You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "sadikovi (via GitHub)" <gi...@apache.org> on 2023/08/25 02:26:23 UTC

[GitHub] [spark] sadikovi opened a new pull request, #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

sadikovi opened a new pull request, #42667:
URL: https://github.com/apache/spark/pull/42667

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   The PR improves JSON parsing when `spark.sql.json.enablePartialResults` is enabled: 
   - Fixes the issue when using nested arrays `ClassCastException: org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow`
   - Improves parsing of the nested struct fields, e.g. `{"a1": "AAA", "a2": [{"f1": "", "f2": ""}], "a3": "id1", "a4": "XXX"}` used to be parsed as `|AAA|NULL    |NULL|NULL|` and now is parsed as `|AAA|[{NULL, }]|id1|XXX|`.
   - Improves performance of nested JSON parsing. The initial implementation would throw too many exceptions when multiple nested fields failed to parse. When the config is disabled, it is not a problem because the entire record is marked as NULL.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   Fixes some corner cases in JSON parsing and improves performance when `spark.sql.json.enablePartialResults` is enabled.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   
   I added tests to verify nested structs, maps, and arrays can be parsed without affecting the subsequent fields in the JSON. I also updated the existing tests when `spark.sql.json.enablePartialResults` is enabled because we parse more data now.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1315206808


##########
sql/core/benchmarks/JsonBenchmark-results.txt:
##########
@@ -3,121 +3,125 @@ Benchmark for performance of JSON parsing
 ================================================================================================
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 JSON schema inferring:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        3720           3843         121          1.3         743.9       1.0X
-UTF-8 is set                                       5412           5455          45          0.9        1082.4       0.7X
+No encoding                                        2084           2134          46          2.4         416.8       1.0X
+UTF-8 is set                                       3077           3093          14          1.6         615.3       0.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 count a short column:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        3234           3254          33          1.5         646.7       1.0X
-UTF-8 is set                                       4847           4868          21          1.0         969.5       0.7X
+No encoding                                        2854           2863           8          1.8         570.8       1.0X
+UTF-8 is set                                       4066           4066           1          1.2         813.1       0.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 count a wide column:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        5702           5794         101          0.2        5702.1       1.0X
-UTF-8 is set                                       9526           9607          73          0.1        9526.1       0.6X
+No encoding                                        3348           3368          26          0.3        3347.8       1.0X
+UTF-8 is set                                       5215           5239          22          0.2        5214.7       0.6X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 select wide row:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                       18318          18448         199          0.0      366367.7       1.0X
-UTF-8 is set                                      19791          19887          99          0.0      395817.1       0.9X
+No encoding                                       11046          11102          54          0.0      220928.4       1.0X
+UTF-8 is set                                      12135          12181          54          0.0      242697.4       0.9X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 Select a subset of 10 columns:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Select 10 columns                                  2531           2570          51          0.4        2531.3       1.0X
-Select 1 column                                    1867           1882          16          0.5        1867.0       1.4X
+Select 10 columns                                  2486           2488           2          0.4        2486.5       1.0X
+Select 1 column                                    1505           1506           2          0.7        1504.6       1.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 creation of JSON parser per line:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Short column without encoding                       868            875           7          1.2         868.4       1.0X
-Short column with UTF-8                            1151           1163          11          0.9        1150.9       0.8X
-Wide column without encoding                      12063          12299         205          0.1       12063.0       0.1X
-Wide column with UTF-8                            16095          16136          51          0.1       16095.3       0.1X
+Short column without encoding                       888            889           3          1.1         887.6       1.0X
+Short column with UTF-8                            1134           1136           2          0.9        1134.3       0.8X
+Wide column without encoding                       8012           8056          51          0.1        8012.4       0.1X
+Wide column with UTF-8                             9830           9844          22          0.1        9829.7       0.1X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                           165            170           4          6.1         164.7       1.0X
-from_json                                          2339           2386          77          0.4        2338.9       0.1X
-json_tuple                                         2667           2730          55          0.4        2667.3       0.1X
-get_json_object                                    2627           2659          32          0.4        2627.1       0.1X
+Text read                                            85             87           2         11.7          85.4       1.0X
+from_json                                          1706           1711           4          0.6        1706.4       0.1X
+json_tuple                                         1528           1534           7          0.7        1528.2       0.1X
+get_json_object                                    1275           1286          17          0.8        1275.0       0.1X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 Dataset of json strings:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                           700            715          20          7.1         140.1       1.0X
-schema inferring                                   3144           3166          20          1.6         628.7       0.2X
-parsing                                            3261           3271           9          1.5         652.1       0.2X
+Text read                                           369            370           1         13.6          73.8       1.0X
+schema inferring                                   1880           1883           4          2.7         376.0       0.2X
+parsing                                            3731           3737           8          1.3         746.1       0.1X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 Json files in the per-line mode:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                          1096           1105          12          4.6         219.1       1.0X
-Schema inferring                                   3818           3830          16          1.3         763.6       0.3X
-Parsing without charset                            4107           4137          32          1.2         821.4       0.3X
-Parsing with UTF-8                                 5717           5763          41          0.9        1143.3       0.2X
+Text read                                           553            579          32          9.0         110.6       1.0X
+Schema inferring                                   2195           2196           2          2.3         439.0       0.3X
+Parsing without charset                            4272           4274           3          1.2         854.3       0.1X

Review Comment:
   Given the ration between `Test read` and `Parsing without charset `, is there a chance of regression in this PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1308683062


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -420,17 +420,17 @@ class JacksonParser(
     case VALUE_STRING if parser.getTextLength < 1 && allowEmptyString =>
       dataType match {
         case FloatType | DoubleType | TimestampType | DateType =>
-          throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+          throw EmptyJsonFieldValueException(dataType)
         case _ => null
       }
 
     case VALUE_STRING if parser.getTextLength < 1 =>
-      throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+      throw EmptyJsonFieldValueException(dataType)

Review Comment:
   is it more efficient to return null directly in the field converter, to avoid exception creation and wrapping? I think the implementation should be simple, we return null only when the mode is PERMISSIVE and enablePartitionResult is true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1309661841


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,7 +454,20 @@ class JacksonParser(
       schema.getFieldIndex(parser.getCurrentName) match {
         case Some(index) =>
           try {
-            row.update(index, fieldConverters(index).apply(parser))
+            val value = try {
+              fieldConverters(index).apply(parser)
+            } catch {
+              case PartialResultException(partialResult, e) if enablePartialResults =>

Review Comment:
   why do we swallow the exception earlier? what happens if we leave the exception to the outer `catch` block?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1704544738

   Shall we merge this or do you have any concerns or questions? I will be more than happy to answer them or follow up on the suggestions.
   
   We may also need to backport to Spark 3.5/3.4 where this change was originally introduced, I can open a separate PR for the backport.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311166208


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala:
##########
@@ -3450,6 +3450,174 @@ abstract class JsonSuite
     }
   }
 
+  test("SPARK-44940: fully parse the record except f1 if partial results are enabled") {

Review Comment:
   Not really, the PR also fixes a ClassCastException error, which should have been fixed when I added the original code for partial results parsing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1696834040

   I could only confirm this regression and the fix in the internal benchmarks, it could take quite some time to port the benchmark + data generation to open source. Is it possible to add it as a follow-up?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1315205770


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -589,12 +605,25 @@ class JacksonParser(
         throw BadRecordException(
           record = () => recordLiteral(record),
           partialResults = () => Array(row),
-          cause)
+          convertCauseForPartialResult(cause))
       case PartialResultArrayException(rows, cause) =>
         throw BadRecordException(
           record = () => recordLiteral(record),
           partialResults = () => rows,
           cause)
+      // These exceptions should never be thrown outside of JacksonParser.
+      // They are used for the control flow in the parser. We add them here for completeness
+      // since they also indicate a bad record.

Review Comment:
   Thank you for adding this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311167949


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,11 +454,17 @@ class JacksonParser(
       schema.getFieldIndex(parser.getCurrentName) match {
         case Some(index) =>
           try {
-            row.update(index, fieldConverters(index).apply(parser))
+            val value = fieldConverters(index).apply(parser)
+            row.update(index, value)
             skipRow = structFilters.skipRow(row, index)
             bitmask(index) = false
           } catch {
             case e: SparkUpgradeException => throw e
+            case err: PartialValueException if enablePartialResults =>
+              badRecordException = badRecordException.orElse(Some(err.cause))
+              row.update(index, err.partialResult)

Review Comment:
   just for my education: `err.partialResult` will be null if it's a simple column, or an array/struct/map with partial reuslt if it's array/struct/map type?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
URL: https://github.com/apache/spark/pull/42667


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1705718755

   Great, thanks. I will follow up on your comment regarding the possible regression in the parsing benchmark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1315208847


##########
sql/core/benchmarks/JsonBenchmark-results.txt:
##########
@@ -3,121 +3,125 @@ Benchmark for performance of JSON parsing
 ================================================================================================
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 JSON schema inferring:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        3720           3843         121          1.3         743.9       1.0X
-UTF-8 is set                                       5412           5455          45          0.9        1082.4       0.7X
+No encoding                                        2084           2134          46          2.4         416.8       1.0X
+UTF-8 is set                                       3077           3093          14          1.6         615.3       0.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 count a short column:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        3234           3254          33          1.5         646.7       1.0X
-UTF-8 is set                                       4847           4868          21          1.0         969.5       0.7X
+No encoding                                        2854           2863           8          1.8         570.8       1.0X
+UTF-8 is set                                       4066           4066           1          1.2         813.1       0.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 count a wide column:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        5702           5794         101          0.2        5702.1       1.0X
-UTF-8 is set                                       9526           9607          73          0.1        9526.1       0.6X
+No encoding                                        3348           3368          26          0.3        3347.8       1.0X
+UTF-8 is set                                       5215           5239          22          0.2        5214.7       0.6X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 select wide row:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                       18318          18448         199          0.0      366367.7       1.0X
-UTF-8 is set                                      19791          19887          99          0.0      395817.1       0.9X
+No encoding                                       11046          11102          54          0.0      220928.4       1.0X
+UTF-8 is set                                      12135          12181          54          0.0      242697.4       0.9X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 Select a subset of 10 columns:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Select 10 columns                                  2531           2570          51          0.4        2531.3       1.0X
-Select 1 column                                    1867           1882          16          0.5        1867.0       1.4X
+Select 10 columns                                  2486           2488           2          0.4        2486.5       1.0X
+Select 1 column                                    1505           1506           2          0.7        1504.6       1.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 creation of JSON parser per line:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Short column without encoding                       868            875           7          1.2         868.4       1.0X
-Short column with UTF-8                            1151           1163          11          0.9        1150.9       0.8X
-Wide column without encoding                      12063          12299         205          0.1       12063.0       0.1X
-Wide column with UTF-8                            16095          16136          51          0.1       16095.3       0.1X
+Short column without encoding                       888            889           3          1.1         887.6       1.0X
+Short column with UTF-8                            1134           1136           2          0.9        1134.3       0.8X
+Wide column without encoding                       8012           8056          51          0.1        8012.4       0.1X
+Wide column with UTF-8                             9830           9844          22          0.1        9829.7       0.1X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                           165            170           4          6.1         164.7       1.0X
-from_json                                          2339           2386          77          0.4        2338.9       0.1X
-json_tuple                                         2667           2730          55          0.4        2667.3       0.1X
-get_json_object                                    2627           2659          32          0.4        2627.1       0.1X
+Text read                                            85             87           2         11.7          85.4       1.0X
+from_json                                          1706           1711           4          0.6        1706.4       0.1X
+json_tuple                                         1528           1534           7          0.7        1528.2       0.1X
+get_json_object                                    1275           1286          17          0.8        1275.0       0.1X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 Dataset of json strings:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                           700            715          20          7.1         140.1       1.0X
-schema inferring                                   3144           3166          20          1.6         628.7       0.2X
-parsing                                            3261           3271           9          1.5         652.1       0.2X
+Text read                                           369            370           1         13.6          73.8       1.0X
+schema inferring                                   1880           1883           4          2.7         376.0       0.2X
+parsing                                            3731           3737           8          1.3         746.1       0.1X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 Json files in the per-line mode:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                          1096           1105          12          4.6         219.1       1.0X
-Schema inferring                                   3818           3830          16          1.3         763.6       0.3X
-Parsing without charset                            4107           4137          32          1.2         821.4       0.3X
-Parsing with UTF-8                                 5717           5763          41          0.9        1143.3       0.2X
+Text read                                           553            579          32          9.0         110.6       1.0X
+Schema inferring                                   2195           2196           2          2.3         439.0       0.3X
+Parsing without charset                            4272           4274           3          1.2         854.3       0.1X

Review Comment:
   The overall ratio decreased because read text benchmark was 2x faster than before. Let me double check this. Now that you pointed it out, I am curious why the benchmark itself did not improve on a faster CPU when other cases did. If I confirm it is a regression, I will follow up on this in a separate PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311168839


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,11 +454,17 @@ class JacksonParser(
       schema.getFieldIndex(parser.getCurrentName) match {
         case Some(index) =>
           try {
-            row.update(index, fieldConverters(index).apply(parser))
+            val value = fieldConverters(index).apply(parser)
+            row.update(index, value)

Review Comment:
   Yes, I will update.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1310855079


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,7 +454,20 @@ class JacksonParser(
       schema.getFieldIndex(parser.getCurrentName) match {
         case Some(index) =>
           try {
-            row.update(index, fieldConverters(index).apply(parser))
+            val value = try {
+              fieldConverters(index).apply(parser)
+            } catch {
+              case PartialResultException(partialResult, e) if enablePartialResults =>

Review Comment:
   Because it is more convenient to do it this way. Yes, we could leave it in the outer catch block but then I would have to copy 
   
   ```scala
   row.update(index, value)
   skipRow = structFilters.skipRow(row, index)
   bitmask(index) = false
   ```
   for each case of PartialResultException.
   
   This is just a convenience thing to not duplicate the code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1306742597


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala:
##########
@@ -65,3 +93,25 @@ case class StringAsDataTypeException(
     fieldName: String,
     fieldValue: String,
     dataType: DataType) extends RuntimeException()
+
+/**
+ * No-stacktrace equivalent of `QueryExecutionErrors.cannotParseJSONFieldError`.
+ * Used for code control flow in the parser without overhead of creating a full exception.
+ */
+case class CannotParseJSONFieldException(
+    fieldName: String,
+    fieldValue: String,
+    jsonType: JsonToken,
+    dataType: DataType) extends RuntimeException() {
+  override def getStackTrace(): Array[StackTraceElement] = new Array[StackTraceElement](0)
+  override def fillInStackTrace(): Throwable = this
+}
+
+/**
+ * No-stacktrace equivalent of `QueryExecutionErrors.emptyJsonFieldValueError`.
+ * Used for code control flow in the parser without overhead of creating a full exception.

Review Comment:
   That is correct, we still use exceptions to report any issues in parsing.
   
   Throwing exceptions allows us to propagate errors across several functions. Without this, we would have to include checks in the code for every result. Also, we would have to capture the exception itself to be reported to the user. If we decided to go with returning some class to indicate failed parsing, then we would essentially be implementing exceptions. Exceptions are not that much slower, most of the time is spent on building the stacktrace which we only need when the very first exception is thrown.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1309386660


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -420,17 +420,17 @@ class JacksonParser(
     case VALUE_STRING if parser.getTextLength < 1 && allowEmptyString =>
       dataType match {
         case FloatType | DoubleType | TimestampType | DateType =>
-          throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+          throw EmptyJsonFieldValueException(dataType)
         case _ => null
       }
 
     case VALUE_STRING if parser.getTextLength < 1 =>
-      throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+      throw EmptyJsonFieldValueException(dataType)

Review Comment:
   I don't know if this is more efficient because you would have to check the parsing mode before returning a result or throwing an exception. Also, in permissive mode, we still want to capture the exception when badRecordsPath is enabled or _corrupt_record column is provided.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1700447290

   @MaxGekk @cloud-fan Can you review again? 
   I have added the benchmark that shows performance almost 3x performance improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1704586475

   I have opened backport PRs (linked in this PR).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1309661841


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,7 +454,20 @@ class JacksonParser(
       schema.getFieldIndex(parser.getCurrentName) match {
         case Some(index) =>
           try {
-            row.update(index, fieldConverters(index).apply(parser))
+            val value = try {
+              fieldConverters(index).apply(parser)
+            } catch {
+              case PartialResultException(partialResult, e) if enablePartialResults =>

Review Comment:
   why do we swallow the exception earlier? what happens if we leave the exception to the outer `catch` block? https://github.com/apache/spark/pull/42667/files#diff-ca601f0051c7c452605d0c5884167447a8673ec0c9095c347d5f67c09506e2ddR475



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1704548723

   Merged to master. It has some conflicts in branch-3.5 and 3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311164222


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala:
##########
@@ -3450,6 +3450,174 @@ abstract class JsonSuite
     }
   }
 
+  test("SPARK-44940: fully parse the record except f1 if partial results are enabled") {

Review Comment:
   just to confirm: this PR is only for perf and we add more tests because the original PR that added partial result handling did not add enough tests?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311164568


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,11 +454,17 @@ class JacksonParser(
       schema.getFieldIndex(parser.getCurrentName) match {
         case Some(index) =>
           try {
-            row.update(index, fieldConverters(index).apply(parser))
+            val value = fieldConverters(index).apply(parser)
+            row.update(index, value)

Review Comment:
   unnecessary change now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1714375871

   Thank you for confirming that, @sadikovi . I agree with you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1305257429


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala:
##########
@@ -65,3 +93,25 @@ case class StringAsDataTypeException(
     fieldName: String,
     fieldValue: String,
     dataType: DataType) extends RuntimeException()
+
+/**
+ * No-stacktrace equivalent of `QueryExecutionErrors.cannotParseJSONFieldError`.
+ * Used for code control flow in the parser without overhead of creating a full exception.
+ */
+case class CannotParseJSONFieldException(
+    fieldName: String,
+    fieldValue: String,
+    jsonType: JsonToken,
+    dataType: DataType) extends RuntimeException() {
+  override def getStackTrace(): Array[StackTraceElement] = new Array[StackTraceElement](0)
+  override def fillInStackTrace(): Throwable = this
+}
+
+/**
+ * No-stacktrace equivalent of `QueryExecutionErrors.emptyJsonFieldValueError`.
+ * Used for code control flow in the parser without overhead of creating a full exception.

Review Comment:
   So we still use exceptions in the control flow, but use a trick to avoid the stacktrace overhead?
   
   Can we avoid using stacktrace in the control flow? or at lease not in the critical code path. e.g. `fieldConverters` should return null directly if the input is empty string and `enablePartialResults` is true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1308154891


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -420,17 +420,17 @@ class JacksonParser(
     case VALUE_STRING if parser.getTextLength < 1 && allowEmptyString =>
       dataType match {
         case FloatType | DoubleType | TimestampType | DateType =>
-          throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+          throw EmptyJsonFieldValueException(dataType)
         case _ => null
       }
 
     case VALUE_STRING if parser.getTextLength < 1 =>
-      throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+      throw EmptyJsonFieldValueException(dataType)

Review Comment:
   Depending on the JSON object, `EmptyJsonFieldValueException` will be intercepted at convertObject, convertMap, or convertArray (because every JSON value will be treated as either of those three although I am not quite sure about from_json functions)and be wrapped into PartialResultException. This exception will eventually be converted into BadRecordException in `JacksonParser.parse` method and processed in `FailureSafeParser` according to the parsing mode.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -420,17 +420,17 @@ class JacksonParser(
     case VALUE_STRING if parser.getTextLength < 1 && allowEmptyString =>
       dataType match {
         case FloatType | DoubleType | TimestampType | DateType =>
-          throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+          throw EmptyJsonFieldValueException(dataType)
         case _ => null
       }
 
     case VALUE_STRING if parser.getTextLength < 1 =>
-      throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+      throw EmptyJsonFieldValueException(dataType)

Review Comment:
   I hope this answered your question.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1309658781


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -420,17 +420,17 @@ class JacksonParser(
     case VALUE_STRING if parser.getTextLength < 1 && allowEmptyString =>
       dataType match {
         case FloatType | DoubleType | TimestampType | DateType =>
-          throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+          throw EmptyJsonFieldValueException(dataType)
         case _ => null
       }
 
     case VALUE_STRING if parser.getTextLength < 1 =>
-      throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+      throw EmptyJsonFieldValueException(dataType)

Review Comment:
   oh this is a good point, we should still throw exceptions to not break other features.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311087439


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,7 +454,20 @@ class JacksonParser(
       schema.getFieldIndex(parser.getCurrentName) match {
         case Some(index) =>
           try {
-            row.update(index, fieldConverters(index).apply(parser))
+            val value = try {
+              fieldConverters(index).apply(parser)
+            } catch {
+              case PartialResultException(partialResult, e) if enablePartialResults =>

Review Comment:
   I have updated the code but it is now more difficult to read in my opinion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1713103692

   @dongjoon-hyun I reran the JSON benchmark and it seems like the previous results that I published were noisy. I confirmed there is no apparent regression in the patch.
   
   I ran only `Json files in the per-line mode` benchmark for 10 iterations. Results:
   
   Without the patch (latest master https://github.com/apache/spark/commit/eb0b09f0f2b518915421365a61d1f3d7d58b4404 with the patch reverted):
   ```
   [info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
   [info] Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
   [info] Json files in the per-line mode:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ------------------------------------------------------------------------------------------------------------------------
   [info] Text read                                           463            476          15         10.8          92.6       1.0X
   [info] Schema inferring                                   2126           2166          48          2.4         425.1       0.2X
   [info] Parsing without charset                            3195           3201           4          1.6         638.9       0.1X
   [info] Parsing with UTF-8                                 4129           4140           8          1.2         825.8       0.1X
   ```
   
   With the patch (latest master):
   ```
   [info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
   [info] Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
   [info] Json files in the per-line mode:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ------------------------------------------------------------------------------------------------------------------------
   [info] Text read                                           459            467           7         10.9          91.8       1.0X
   [info] Schema inferring                                   2159           2198          45          2.3         431.7       0.2X
   [info] Parsing without charset                            3106           3119          12          1.6         621.2       0.1X
   [info] Parsing with UTF-8                                 4071           4090          10          1.2         814.2       0.1X
   ```
   
   It seems the results are approximately the same as before. However, the benchmark results tend to fluctuate quite a lot. For example, when I reran the same benchmark without any code changes: 
   
   
   The second run with the patch:
   ```
   [info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
   [info] Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
   [info] Json files in the per-line mode:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ------------------------------------------------------------------------------------------------------------------------
   [info] Text read                                           458            469           9         10.9          91.5       1.0X
   [info] Schema inferring                                   2147           2184          48          2.3         429.4       0.2X
   [info] Parsing without charset                            3294           3308          10          1.5         658.8       0.1X
   [info] Parsing with UTF-8                                 4437           4444           8          1.1         887.4       0.1X
   ```
   
   I think it is fine and it is just noise in the benchmark, no apparent regression because of the patch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311038797


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,7 +454,20 @@ class JacksonParser(
       schema.getFieldIndex(parser.getCurrentName) match {
         case Some(index) =>
           try {
-            row.update(index, fieldConverters(index).apply(parser))
+            val value = try {
+              fieldConverters(index).apply(parser)
+            } catch {
+              case PartialResultException(partialResult, e) if enablePartialResults =>

Review Comment:
   Can we have a common trait of `PartialResultException`, `PartialArrayDataResultException` and `PartialMapDataResultException`, so that we don't need to duplicate code?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311171288


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,11 +454,17 @@ class JacksonParser(
       schema.getFieldIndex(parser.getCurrentName) match {
         case Some(index) =>
           try {
-            row.update(index, fieldConverters(index).apply(parser))
+            val value = fieldConverters(index).apply(parser)
+            row.update(index, value)
             skipRow = structFilters.skipRow(row, index)
             bitmask(index) = false
           } catch {
             case e: SparkUpgradeException => throw e
+            case err: PartialValueException if enablePartialResults =>
+              badRecordException = badRecordException.orElse(Some(err.cause))
+              row.update(index, err.partialResult)

Review Comment:
   Actually, not. When we encounter PartialValueException/`err.partialResult`, it is already a complex value, e.g. struct/map/array. If there was an error parsing a primitive value, it would be caught in NonFatal clause and then rethrown as PartialValueException.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311066933


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,7 +454,20 @@ class JacksonParser(
       schema.getFieldIndex(parser.getCurrentName) match {
         case Some(index) =>
           try {
-            row.update(index, fieldConverters(index).apply(parser))
+            val value = try {
+              fieldConverters(index).apply(parser)
+            } catch {
+              case PartialResultException(partialResult, e) if enablePartialResults =>

Review Comment:
   This is a good idea, let me make the change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1701995162

   cc @HyukjinKwon for review as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1306867310


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -420,17 +420,17 @@ class JacksonParser(
     case VALUE_STRING if parser.getTextLength < 1 && allowEmptyString =>
       dataType match {
         case FloatType | DoubleType | TimestampType | DateType =>
-          throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+          throw EmptyJsonFieldValueException(dataType)
         case _ => null
       }
 
     case VALUE_STRING if parser.getTextLength < 1 =>
-      throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+      throw EmptyJsonFieldValueException(dataType)

Review Comment:
   the control flow here is bit convoluted, when/where is `EmptyJsonFieldValueException` being handled first?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org