You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "leixm (via GitHub)" <gi...@apache.org> on 2024/03/29 03:30:23 UTC

[PR] [SPARK-47639] Support codegen for json_tuple. [spark]

leixm opened a new pull request, #45765:
URL: https://github.com/apache/spark/pull/45765

   ### What changes were proposed in this pull request?
   Support codegen for json_tuple.
   
   
   ### Why are the changes needed?
   Sometimes using json_tuple may cause performance regression because it does not support whole stage codegen..
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Existing UTs.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #45765:
URL: https://github.com/apache/spark/pull/45765#issuecomment-2049416811

   @leixm Sorry, I've been busy with internal matters at the company recently, so it might take me a while to focus on this PR.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "leixm (via GitHub)" <gi...@apache.org>.

leixm commented on PR #45765:
URL: https://github.com/apache/spark/pull/45765#issuecomment-2031524457

   Seems flaky test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45765:
URL: https://github.com/apache/spark/pull/45765#discussion_r1561053712


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala:
##########
@@ -501,55 +503,156 @@ case class JsonTuple(children: Seq[Expression])
       return nullRow
     }
 
+    val fieldNames = if (constantFields == fieldExpressions.length) {

Review Comment:
   one idea to simplify the implementation: I think "all constant field names" is the most common case, so we should optimize for it. Mixed case is rather rare and we should just treat it as "no constant field name" to simplify things.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45765:
URL: https://github.com/apache/spark/pull/45765#discussion_r1555471908


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala:
##########
@@ -501,55 +503,156 @@ case class JsonTuple(children: Seq[Expression])
       return nullRow
     }
 
+    val fieldNames = if (constantFields == fieldExpressions.length) {
+      // typically the user will provide the field names as foldable expressions
+      // so we can use the cached copy
+      foldableFieldNames.map(_.orNull)
+    } else if (constantFields == 0) {
+      // none are foldable so all field names need to be evaluated from the input row
+      fieldExpressions.zipWithIndex.map {
+        case (expr, index) =>
+          Option(expr.eval(input)).map {
+            path =>
+              JsonPathIndex(index, path.asInstanceOf[UTF8String].toString)
+          }.orNull
+      }
+    } else {
+      // if there is a mix of constant and non-constant expressions
+      // prefer the cached copy when available
+      foldableFieldNames.zip(fieldExpressions).zipWithIndex.map {
+        case ((null, expr), index) =>
+          Option(expr.eval(input)).map {
+            path =>
+              JsonPathIndex(index, path.asInstanceOf[UTF8String].toString)
+          }.orNull
+        case ((fieldName, _), _) => fieldName.orNull
+      }
+    }
+
+    val evaluator = new JsonTupleEvaluator(
+      json,
+      fieldNames
+        .filter{jsonPathIndex => jsonPathIndex != null && jsonPathIndex.path != null }
+        .asJava,
+      fieldExpressions.length)
+    evaluator.evaluate()
+  }
+
+  override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): JsonTuple =
+    copy(children = newChildren)
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {

Review Comment:
   Haven't looked into it yet, but is it possible to make codegen simpler and write most of the code in Scala?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #45765:
URL: https://github.com/apache/spark/pull/45765#discussion_r1547512878


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala:
##########
@@ -272,11 +272,6 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with
     assert(jt.eval(null).iterator.to(Seq).head === expected)
   }
 
-  test("json_tuple escaping") {

Review Comment:
   cc @MaxGekk 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "leixm (via GitHub)" <gi...@apache.org>.

leixm commented on PR #45765:
URL: https://github.com/apache/spark/pull/45765#issuecomment-2042017037

   @LuciferYang  @MaxGekk  PTAL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #45765:
URL: https://github.com/apache/spark/pull/45765#issuecomment-2029332226

   @leixm 
   There are some test failed like
   ```
   [info] - json_tuple escaping *** FAILED *** (10 milliseconds)
   [info]   java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 48, Column 30: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 48, Column 30: Assignment conversion not possible from type "scala.collection.IterableOnce" to type "org.apache.spark.sql.catalyst.util.ArrayData"
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #45765:
URL: https://github.com/apache/spark/pull/45765#issuecomment-2026577654

   What is the newly added `TestCodeGen.java` used for?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "leixm (via GitHub)" <gi...@apache.org>.

leixm commented on PR #45765:
URL: https://github.com/apache/spark/pull/45765#issuecomment-2031015857

   > I haven't looked at the code in detail yet, but I have two questions first:
   > 
   > 1. After this PR, which test cases still cover the non-codegen code branches? The test cases related to `json_tuple`in `org.apache.spark.sql.JsonFunctionsSuite` seem to have all been changed to cover the codegen branch.
   > 2. Can you add a test branch for `json_tuple` in `JsonBenchmark` to compare codegen on and codegen off and update the benchmark results? Just like what `get_json_object` did. (The benchmark result can be obtained by running benchmark.yml with GA.)
   > 
   > https://github.com/apache/spark/blob/a8b247e9a50ae0450360e76bc69b2c6cdf5ea6f8/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala#L270C24-L280
   > 
   > ![image](https://private-user-images.githubusercontent.com/1475305/317965181-6916a9cb-ddd8-44f6-90e4-01e14b925a78.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTIwMjk1MTUsIm5iZiI6MTcxMjAyOTIxNSwicGF0aCI6Ii8xNDc1MzA1LzMxNzk2NTE4MS02OTE2YTljYi1kZGQ4LTQ0ZjYtOTBlNC0wMWUxNGI5MjVhNzgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDQwMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA0MDJUMDM0MDE1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YzMzYTdjOGUwZWJjNWU0Y2JhNDMxYmJjMjRiZDAyNmJkYzA2NTU2YTU2YmQ3OWUzMTViMGUwZGYzMmQxMTkzYSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.CdtHAojG1a7tpBtc5dVUZPFoVHScivHDnPimAHV-8m8)
   
   Sure, i have added codegen disable case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "leixm (via GitHub)" <gi...@apache.org>.

leixm commented on code in PR #45765:
URL: https://github.com/apache/spark/pull/45765#discussion_r1555489388


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala:
##########
@@ -501,55 +503,156 @@ case class JsonTuple(children: Seq[Expression])
       return nullRow
     }
 
+    val fieldNames = if (constantFields == fieldExpressions.length) {
+      // typically the user will provide the field names as foldable expressions
+      // so we can use the cached copy
+      foldableFieldNames.map(_.orNull)
+    } else if (constantFields == 0) {
+      // none are foldable so all field names need to be evaluated from the input row
+      fieldExpressions.zipWithIndex.map {
+        case (expr, index) =>
+          Option(expr.eval(input)).map {
+            path =>
+              JsonPathIndex(index, path.asInstanceOf[UTF8String].toString)
+          }.orNull
+      }
+    } else {
+      // if there is a mix of constant and non-constant expressions
+      // prefer the cached copy when available
+      foldableFieldNames.zip(fieldExpressions).zipWithIndex.map {
+        case ((null, expr), index) =>
+          Option(expr.eval(input)).map {
+            path =>
+              JsonPathIndex(index, path.asInstanceOf[UTF8String].toString)
+          }.orNull
+        case ((fieldName, _), _) => fieldName.orNull
+      }
+    }
+
+    val evaluator = new JsonTupleEvaluator(
+      json,
+      fieldNames
+        .filter{jsonPathIndex => jsonPathIndex != null && jsonPathIndex.path != null }
+        .asJava,
+      fieldExpressions.length)
+    evaluator.evaluate()
+  }
+
+  override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): JsonTuple =
+    copy(children = newChildren)
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {

Review Comment:
   Because we have to consider calculating the foldable expr in advance, which is the reason why doGenCode is bloated. I have tried to simplify the codegen code as much as possible. Do you have any good suggestions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #45765:
URL: https://github.com/apache/spark/pull/45765#discussion_r1547511307


##########
sql/core/benchmarks/JsonBenchmark-results.txt:
##########
@@ -3,128 +3,129 @@ Benchmark for performance of JSON parsing
 ================================================================================================
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
-AMD EPYC 7763 64-Core Processor
+OpenJDK 64-Bit Server VM 17.0.9+0 on Mac OS X 12.6.7

Review Comment:
   Please use GitHub Action's machine to generate this file, and also update the result file for Java 21



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "leixm (via GitHub)" <gi...@apache.org>.

leixm commented on PR #45765:
URL: https://github.com/apache/spark/pull/45765#issuecomment-2031020720

   Benchmark result:
   ```
   [info] JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ------------------------------------------------------------------------------------------------------------------------
   [info] Text read                                            66             71           5         15.1          66.1       1.0X
   [info] from_json                                          1205           1226          22          0.8        1205.4       0.1X
   [info] json_tuple wholestage off                          1562           1604          36          0.6        1562.1       0.0X
   [info] json_tuple wholestage on                           1334           1348          12          0.7        1333.9       0.0X
   [info] get_json_object wholestage off                     1198           1230          35          0.8        1198.5       0.1X
   [info] get_json_object wholestage on                      1217           1238          25          0.8        1216.5       0.1X
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "leixm (via GitHub)" <gi...@apache.org>.

leixm commented on PR #45765:
URL: https://github.com/apache/spark/pull/45765#issuecomment-2031020085

   I deleted below code, this ut will cause an error(Assignment conversion not possible from type "scala.collection.IterableOnce" to type "org.apache.spark.sql.catalyst.util.ArrayData"), because `GenerateExec` will generate code through `codeGenIterableOnce` in normal scene, and the type of ev.value is IterableOnce.
   ```
   test("json_tuple escaping") {
       GenerateUnsafeProjection.generate(
         JsonTuple(Literal("\"quote") ::  Literal("\"quote") :: Nil) :: Nil)
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #45765:
URL: https://github.com/apache/spark/pull/45765#issuecomment-2026699903

   I haven't looked at the code in detail yet, but I have two questions first:
   
   1. After this PR, which test cases still cover the non-codegen code branches? The test cases related to `json_tuple`in `org.apache.spark.sql.JsonFunctionsSuite` seem to have all been changed to cover the codegen branch.
   
   2. Can you add a test branch for `json_tuple` in `JsonBenchmark` to compare codegen on and codegen off and update the test results? Just like what `get_json_object` did. (The test result can be obtained by running benchmark.yml with GA.)
   
   https://github.com/apache/spark/blob/a8b247e9a50ae0450360e76bc69b2c6cdf5ea6f8/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala#L270C24-L280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "leixm (via GitHub)" <gi...@apache.org>.

leixm commented on PR #45765:
URL: https://github.com/apache/spark/pull/45765#issuecomment-2026573777

   @LuciferYang  @cloud-fan  Can you help review plz?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "leixm (via GitHub)" <gi...@apache.org>.

leixm commented on PR #45765:
URL: https://github.com/apache/spark/pull/45765#issuecomment-2026580609

   > What is the newly added `TestCodeGen.java` used for?
   
   Sorry, i have deleted it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "leixm (via GitHub)" <gi...@apache.org>.

leixm commented on code in PR #45765:
URL: https://github.com/apache/spark/pull/45765#discussion_r1547726290


##########
sql/core/benchmarks/JsonBenchmark-results.txt:
##########
@@ -3,128 +3,129 @@ Benchmark for performance of JSON parsing
 ================================================================================================
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
-AMD EPYC 7763 64-Core Processor
+OpenJDK 64-Bit Server VM 17.0.9+0 on Mac OS X 12.6.7

Review Comment:
   done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45765:
URL: https://github.com/apache/spark/pull/45765#discussion_r1561057212


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala:
##########
@@ -501,55 +503,156 @@ case class JsonTuple(children: Seq[Expression])
       return nullRow
     }
 
+    val fieldNames = if (constantFields == fieldExpressions.length) {
+      // typically the user will provide the field names as foldable expressions
+      // so we can use the cached copy
+      foldableFieldNames.map(_.orNull)
+    } else if (constantFields == 0) {
+      // none are foldable so all field names need to be evaluated from the input row
+      fieldExpressions.zipWithIndex.map {
+        case (expr, index) =>
+          Option(expr.eval(input)).map {
+            path =>
+              JsonPathIndex(index, path.asInstanceOf[UTF8String].toString)
+          }.orNull
+      }
+    } else {
+      // if there is a mix of constant and non-constant expressions
+      // prefer the cached copy when available
+      foldableFieldNames.zip(fieldExpressions).zipWithIndex.map {
+        case ((null, expr), index) =>
+          Option(expr.eval(input)).map {
+            path =>
+              JsonPathIndex(index, path.asInstanceOf[UTF8String].toString)
+          }.orNull
+        case ((fieldName, _), _) => fieldName.orNull
+      }
+    }
+
+    val evaluator = new JsonTupleEvaluator(
+      json,
+      fieldNames
+        .filter{jsonPathIndex => jsonPathIndex != null && jsonPathIndex.path != null }
+        .asJava,
+      fieldExpressions.length)
+    evaluator.evaluate()
+  }
+
+  override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): JsonTuple =
+    copy(children = newChildren)
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val evaluatorClass = classOf[JsonTupleEvaluator].getName
+    val fieldNames = ctx.freshName("fieldNames")
+    val jsonEval = jsonExpr.genCode(ctx)
+
+    val parseCode = if (constantFields == fieldExpressions.length) {
+      val cacheName = ctx.freshName("foldableFieldNames")
+      val ref = ctx.addReferenceObj(cacheName, foldableFieldNames.collect {
+        case Some(v) => v
+      }.asJava)
+      Seq(s"""$fieldNames = $ref;""")
+    } else if (constantFields == 0) {
+      // none are foldable so all field names need to be evaluated from the input row
+      fieldExpressions.zipWithIndex.map {
+        case (e, i) =>
+          val eval = e.genCode(ctx)
+          s"""
+             |${eval.code}
+             |if (!${eval.isNull}) {
+             |  $fieldNames.add(new JsonPathIndex($i, ${eval.value}.toString()));
+             |}
+           """.stripMargin
+      }
+    } else {
+      val cacheName = ctx.freshName("foldableFieldNames")
+      val ref =
+        ctx.addReferenceObj(cacheName, foldableFieldNames.collect {
+          case Some(v) => v
+        }.toBuffer.asJava)
+      val codeList = ArrayBuffer(s"""$fieldNames = $ref;""")
+      // if there is a mix of constant and non-constant expressions
+      // prefer the cached copy when available
+      foldableFieldNames.zip(fieldExpressions).zipWithIndex.foreach {
+        case ((null, e), i) =>
+          val eval = e.genCode(ctx)
+          codeList +=
+            s"""
+               |${eval.code}
+               |if (!${eval.isNull}) {
+               |  $fieldNames.add(new JsonPathIndex($i, ${eval.value}.toString()));
+               |}
+           """.stripMargin
+        case ((_, _), _) =>
+      }
+      codeList
+    }
+
+    val splitParseCode = ctx.splitExpressionsWithCurrentInputs(

Review Comment:
   We don't need to split the method if we don't optimize the mixed case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org