You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "sadikovi (via GitHub)" <gi...@apache.org> on 2023/08/25 02:26:23 UTC
[GitHub] [spark] sadikovi opened a new pull request, #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
sadikovi opened a new pull request, #42667:
URL: https://github.com/apache/spark/pull/42667
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a faster review.
7. If you want to add a new configuration, please read the guideline first for naming configurations in
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
8. If you want to add or modify an error type or message, please read the guideline first in
'core/src/main/resources/error/README.md'.
-->
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
2. If you fix some SQL features, you can provide some references of other DBMSes.
3. If there is design documentation, please add the link.
4. If there is a discussion in the mailing list, please add the link.
-->
The PR improves JSON parsing when `spark.sql.json.enablePartialResults` is enabled:
- Fixes the issue when using nested arrays `ClassCastException: org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow`
- Improves parsing of the nested struct fields, e.g. `{"a1": "AAA", "a2": [{"f1": "", "f2": ""}], "a3": "id1", "a4": "XXX"}` used to be parsed as `|AAA|NULL |NULL|NULL|` and now is parsed as `|AAA|[{NULL, }]|id1|XXX|`.
- Improves performance of nested JSON parsing. The initial implementation would throw too many exceptions when multiple nested fields failed to parse. When the config is disabled, it is not a problem because the entire record is marked as NULL.
### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
Fixes some corner cases in JSON parsing and improves performance when `spark.sql.json.enablePartialResults` is enabled.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
No.
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
-->
I added tests to verify nested structs, maps, and arrays can be parsed without affecting the subsequent fields in the JSON. I also updated the existing tests when `spark.sql.json.enablePartialResults` is enabled because we parse more data now.
### Was this patch authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this patch, please include the
phrase: 'Generated-by: ' followed by the name of the tool and its version.
If no, write 'No'.
Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
-->
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1315206808
##########
sql/core/benchmarks/JsonBenchmark-results.txt:
##########
@@ -3,121 +3,125 @@ Benchmark for performance of JSON parsing
================================================================================================
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
JSON schema inferring: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 3720 3843 121 1.3 743.9 1.0X
-UTF-8 is set 5412 5455 45 0.9 1082.4 0.7X
+No encoding 2084 2134 46 2.4 416.8 1.0X
+UTF-8 is set 3077 3093 14 1.6 615.3 0.7X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
count a short column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 3234 3254 33 1.5 646.7 1.0X
-UTF-8 is set 4847 4868 21 1.0 969.5 0.7X
+No encoding 2854 2863 8 1.8 570.8 1.0X
+UTF-8 is set 4066 4066 1 1.2 813.1 0.7X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 5702 5794 101 0.2 5702.1 1.0X
-UTF-8 is set 9526 9607 73 0.1 9526.1 0.6X
+No encoding 3348 3368 26 0.3 3347.8 1.0X
+UTF-8 is set 5215 5239 22 0.2 5214.7 0.6X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
select wide row: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 18318 18448 199 0.0 366367.7 1.0X
-UTF-8 is set 19791 19887 99 0.0 395817.1 0.9X
+No encoding 11046 11102 54 0.0 220928.4 1.0X
+UTF-8 is set 12135 12181 54 0.0 242697.4 0.9X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Select a subset of 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Select 10 columns 2531 2570 51 0.4 2531.3 1.0X
-Select 1 column 1867 1882 16 0.5 1867.0 1.4X
+Select 10 columns 2486 2488 2 0.4 2486.5 1.0X
+Select 1 column 1505 1506 2 0.7 1504.6 1.7X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
creation of JSON parser per line: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Short column without encoding 868 875 7 1.2 868.4 1.0X
-Short column with UTF-8 1151 1163 11 0.9 1150.9 0.8X
-Wide column without encoding 12063 12299 205 0.1 12063.0 0.1X
-Wide column with UTF-8 16095 16136 51 0.1 16095.3 0.1X
+Short column without encoding 888 889 3 1.1 887.6 1.0X
+Short column with UTF-8 1134 1136 2 0.9 1134.3 0.8X
+Wide column without encoding 8012 8056 51 0.1 8012.4 0.1X
+Wide column with UTF-8 9830 9844 22 0.1 9829.7 0.1X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 165 170 4 6.1 164.7 1.0X
-from_json 2339 2386 77 0.4 2338.9 0.1X
-json_tuple 2667 2730 55 0.4 2667.3 0.1X
-get_json_object 2627 2659 32 0.4 2627.1 0.1X
+Text read 85 87 2 11.7 85.4 1.0X
+from_json 1706 1711 4 0.6 1706.4 0.1X
+json_tuple 1528 1534 7 0.7 1528.2 0.1X
+get_json_object 1275 1286 17 0.8 1275.0 0.1X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Dataset of json strings: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 700 715 20 7.1 140.1 1.0X
-schema inferring 3144 3166 20 1.6 628.7 0.2X
-parsing 3261 3271 9 1.5 652.1 0.2X
+Text read 369 370 1 13.6 73.8 1.0X
+schema inferring 1880 1883 4 2.7 376.0 0.2X
+parsing 3731 3737 8 1.3 746.1 0.1X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Json files in the per-line mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 1096 1105 12 4.6 219.1 1.0X
-Schema inferring 3818 3830 16 1.3 763.6 0.3X
-Parsing without charset 4107 4137 32 1.2 821.4 0.3X
-Parsing with UTF-8 5717 5763 41 0.9 1143.3 0.2X
+Text read 553 579 32 9.0 110.6 1.0X
+Schema inferring 2195 2196 2 2.3 439.0 0.3X
+Parsing without charset 4272 4274 3 1.2 854.3 0.1X
Review Comment:
Given the ration between `Test read` and `Parsing without charset `, is there a chance of regression in this PR?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1308683062
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -420,17 +420,17 @@ class JacksonParser(
case VALUE_STRING if parser.getTextLength < 1 && allowEmptyString =>
dataType match {
case FloatType | DoubleType | TimestampType | DateType =>
- throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+ throw EmptyJsonFieldValueException(dataType)
case _ => null
}
case VALUE_STRING if parser.getTextLength < 1 =>
- throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+ throw EmptyJsonFieldValueException(dataType)
Review Comment:
is it more efficient to return null directly in the field converter, to avoid exception creation and wrapping? I think the implementation should be simple, we return null only when the mode is PERMISSIVE and enablePartitionResult is true.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1309661841
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,7 +454,20 @@ class JacksonParser(
schema.getFieldIndex(parser.getCurrentName) match {
case Some(index) =>
try {
- row.update(index, fieldConverters(index).apply(parser))
+ val value = try {
+ fieldConverters(index).apply(parser)
+ } catch {
+ case PartialResultException(partialResult, e) if enablePartialResults =>
Review Comment:
why do we swallow the exception earlier? what happens if we leave the exception to the outer `catch` block?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1704544738
Shall we merge this or do you have any concerns or questions? I will be more than happy to answer them or follow up on the suggestions.
We may also need to backport to Spark 3.5/3.4 where this change was originally introduced, I can open a separate PR for the backport.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311166208
##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala:
##########
@@ -3450,6 +3450,174 @@ abstract class JsonSuite
}
}
+ test("SPARK-44940: fully parse the record except f1 if partial results are enabled") {
Review Comment:
Not really, the PR also fixes a ClassCastException error, which should have been fixed when I added the original code for partial results parsing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1696834040
I could only confirm this regression and the fix in the internal benchmarks, it could take quite some time to port the benchmark + data generation to open source. Is it possible to add it as a follow-up?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1315205770
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -589,12 +605,25 @@ class JacksonParser(
throw BadRecordException(
record = () => recordLiteral(record),
partialResults = () => Array(row),
- cause)
+ convertCauseForPartialResult(cause))
case PartialResultArrayException(rows, cause) =>
throw BadRecordException(
record = () => recordLiteral(record),
partialResults = () => rows,
cause)
+ // These exceptions should never be thrown outside of JacksonParser.
+ // They are used for the control flow in the parser. We add them here for completeness
+ // since they also indicate a bad record.
Review Comment:
Thank you for adding this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311167949
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,11 +454,17 @@ class JacksonParser(
schema.getFieldIndex(parser.getCurrentName) match {
case Some(index) =>
try {
- row.update(index, fieldConverters(index).apply(parser))
+ val value = fieldConverters(index).apply(parser)
+ row.update(index, value)
skipRow = structFilters.skipRow(row, index)
bitmask(index) = false
} catch {
case e: SparkUpgradeException => throw e
+ case err: PartialValueException if enablePartialResults =>
+ badRecordException = badRecordException.orElse(Some(err.cause))
+ row.update(index, err.partialResult)
Review Comment:
just for my education: `err.partialResult` will be null if it's a simple column, or an array/struct/map with partial reuslt if it's array/struct/map type?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
URL: https://github.com/apache/spark/pull/42667
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1705718755
Great, thanks. I will follow up on your comment regarding the possible regression in the parsing benchmark.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1315208847
##########
sql/core/benchmarks/JsonBenchmark-results.txt:
##########
@@ -3,121 +3,125 @@ Benchmark for performance of JSON parsing
================================================================================================
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
JSON schema inferring: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 3720 3843 121 1.3 743.9 1.0X
-UTF-8 is set 5412 5455 45 0.9 1082.4 0.7X
+No encoding 2084 2134 46 2.4 416.8 1.0X
+UTF-8 is set 3077 3093 14 1.6 615.3 0.7X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
count a short column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 3234 3254 33 1.5 646.7 1.0X
-UTF-8 is set 4847 4868 21 1.0 969.5 0.7X
+No encoding 2854 2863 8 1.8 570.8 1.0X
+UTF-8 is set 4066 4066 1 1.2 813.1 0.7X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 5702 5794 101 0.2 5702.1 1.0X
-UTF-8 is set 9526 9607 73 0.1 9526.1 0.6X
+No encoding 3348 3368 26 0.3 3347.8 1.0X
+UTF-8 is set 5215 5239 22 0.2 5214.7 0.6X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
select wide row: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 18318 18448 199 0.0 366367.7 1.0X
-UTF-8 is set 19791 19887 99 0.0 395817.1 0.9X
+No encoding 11046 11102 54 0.0 220928.4 1.0X
+UTF-8 is set 12135 12181 54 0.0 242697.4 0.9X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Select a subset of 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Select 10 columns 2531 2570 51 0.4 2531.3 1.0X
-Select 1 column 1867 1882 16 0.5 1867.0 1.4X
+Select 10 columns 2486 2488 2 0.4 2486.5 1.0X
+Select 1 column 1505 1506 2 0.7 1504.6 1.7X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
creation of JSON parser per line: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Short column without encoding 868 875 7 1.2 868.4 1.0X
-Short column with UTF-8 1151 1163 11 0.9 1150.9 0.8X
-Wide column without encoding 12063 12299 205 0.1 12063.0 0.1X
-Wide column with UTF-8 16095 16136 51 0.1 16095.3 0.1X
+Short column without encoding 888 889 3 1.1 887.6 1.0X
+Short column with UTF-8 1134 1136 2 0.9 1134.3 0.8X
+Wide column without encoding 8012 8056 51 0.1 8012.4 0.1X
+Wide column with UTF-8 9830 9844 22 0.1 9829.7 0.1X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 165 170 4 6.1 164.7 1.0X
-from_json 2339 2386 77 0.4 2338.9 0.1X
-json_tuple 2667 2730 55 0.4 2667.3 0.1X
-get_json_object 2627 2659 32 0.4 2627.1 0.1X
+Text read 85 87 2 11.7 85.4 1.0X
+from_json 1706 1711 4 0.6 1706.4 0.1X
+json_tuple 1528 1534 7 0.7 1528.2 0.1X
+get_json_object 1275 1286 17 0.8 1275.0 0.1X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Dataset of json strings: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 700 715 20 7.1 140.1 1.0X
-schema inferring 3144 3166 20 1.6 628.7 0.2X
-parsing 3261 3271 9 1.5 652.1 0.2X
+Text read 369 370 1 13.6 73.8 1.0X
+schema inferring 1880 1883 4 2.7 376.0 0.2X
+parsing 3731 3737 8 1.3 746.1 0.1X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Json files in the per-line mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 1096 1105 12 4.6 219.1 1.0X
-Schema inferring 3818 3830 16 1.3 763.6 0.3X
-Parsing without charset 4107 4137 32 1.2 821.4 0.3X
-Parsing with UTF-8 5717 5763 41 0.9 1143.3 0.2X
+Text read 553 579 32 9.0 110.6 1.0X
+Schema inferring 2195 2196 2 2.3 439.0 0.3X
+Parsing without charset 4272 4274 3 1.2 854.3 0.1X
Review Comment:
The overall ratio decreased because read text benchmark was 2x faster than before. Let me double check this. Now that you pointed it out, I am curious why the benchmark itself did not improve on a faster CPU when other cases did. If I confirm it is a regression, I will follow up on this in a separate PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311168839
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,11 +454,17 @@ class JacksonParser(
schema.getFieldIndex(parser.getCurrentName) match {
case Some(index) =>
try {
- row.update(index, fieldConverters(index).apply(parser))
+ val value = fieldConverters(index).apply(parser)
+ row.update(index, value)
Review Comment:
Yes, I will update.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1310855079
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,7 +454,20 @@ class JacksonParser(
schema.getFieldIndex(parser.getCurrentName) match {
case Some(index) =>
try {
- row.update(index, fieldConverters(index).apply(parser))
+ val value = try {
+ fieldConverters(index).apply(parser)
+ } catch {
+ case PartialResultException(partialResult, e) if enablePartialResults =>
Review Comment:
Because it is more convenient to do it this way. Yes, we could leave it in the outer catch block but then I would have to copy
```scala
row.update(index, value)
skipRow = structFilters.skipRow(row, index)
bitmask(index) = false
```
for each case of PartialResultException.
This is just a convenience thing to not duplicate the code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1306742597
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala:
##########
@@ -65,3 +93,25 @@ case class StringAsDataTypeException(
fieldName: String,
fieldValue: String,
dataType: DataType) extends RuntimeException()
+
+/**
+ * No-stacktrace equivalent of `QueryExecutionErrors.cannotParseJSONFieldError`.
+ * Used for code control flow in the parser without overhead of creating a full exception.
+ */
+case class CannotParseJSONFieldException(
+ fieldName: String,
+ fieldValue: String,
+ jsonType: JsonToken,
+ dataType: DataType) extends RuntimeException() {
+ override def getStackTrace(): Array[StackTraceElement] = new Array[StackTraceElement](0)
+ override def fillInStackTrace(): Throwable = this
+}
+
+/**
+ * No-stacktrace equivalent of `QueryExecutionErrors.emptyJsonFieldValueError`.
+ * Used for code control flow in the parser without overhead of creating a full exception.
Review Comment:
That is correct, we still use exceptions to report any issues in parsing.
Throwing exceptions allows us to propagate errors across several functions. Without this, we would have to include checks in the code for every result. Also, we would have to capture the exception itself to be reported to the user. If we decided to go with returning some class to indicate failed parsing, then we would essentially be implementing exceptions. Exceptions are not that much slower, most of the time is spent on building the stacktrace which we only need when the very first exception is thrown.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1309386660
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -420,17 +420,17 @@ class JacksonParser(
case VALUE_STRING if parser.getTextLength < 1 && allowEmptyString =>
dataType match {
case FloatType | DoubleType | TimestampType | DateType =>
- throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+ throw EmptyJsonFieldValueException(dataType)
case _ => null
}
case VALUE_STRING if parser.getTextLength < 1 =>
- throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+ throw EmptyJsonFieldValueException(dataType)
Review Comment:
I don't know if this is more efficient because you would have to check the parsing mode before returning a result or throwing an exception. Also, in permissive mode, we still want to capture the exception when badRecordsPath is enabled or _corrupt_record column is provided.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1700447290
@MaxGekk @cloud-fan Can you review again?
I have added the benchmark that shows performance almost 3x performance improvement.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1704586475
I have opened backport PRs (linked in this PR).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1309661841
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,7 +454,20 @@ class JacksonParser(
schema.getFieldIndex(parser.getCurrentName) match {
case Some(index) =>
try {
- row.update(index, fieldConverters(index).apply(parser))
+ val value = try {
+ fieldConverters(index).apply(parser)
+ } catch {
+ case PartialResultException(partialResult, e) if enablePartialResults =>
Review Comment:
why do we swallow the exception earlier? what happens if we leave the exception to the outer `catch` block? https://github.com/apache/spark/pull/42667/files#diff-ca601f0051c7c452605d0c5884167447a8673ec0c9095c347d5f67c09506e2ddR475
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1704548723
Merged to master. It has some conflicts in branch-3.5 and 3.4.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311164222
##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala:
##########
@@ -3450,6 +3450,174 @@ abstract class JsonSuite
}
}
+ test("SPARK-44940: fully parse the record except f1 if partial results are enabled") {
Review Comment:
just to confirm: this PR is only for perf and we add more tests because the original PR that added partial result handling did not add enough tests?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311164568
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,11 +454,17 @@ class JacksonParser(
schema.getFieldIndex(parser.getCurrentName) match {
case Some(index) =>
try {
- row.update(index, fieldConverters(index).apply(parser))
+ val value = fieldConverters(index).apply(parser)
+ row.update(index, value)
Review Comment:
unnecessary change now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1714375871
Thank you for confirming that, @sadikovi . I agree with you.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1305257429
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala:
##########
@@ -65,3 +93,25 @@ case class StringAsDataTypeException(
fieldName: String,
fieldValue: String,
dataType: DataType) extends RuntimeException()
+
+/**
+ * No-stacktrace equivalent of `QueryExecutionErrors.cannotParseJSONFieldError`.
+ * Used for code control flow in the parser without overhead of creating a full exception.
+ */
+case class CannotParseJSONFieldException(
+ fieldName: String,
+ fieldValue: String,
+ jsonType: JsonToken,
+ dataType: DataType) extends RuntimeException() {
+ override def getStackTrace(): Array[StackTraceElement] = new Array[StackTraceElement](0)
+ override def fillInStackTrace(): Throwable = this
+}
+
+/**
+ * No-stacktrace equivalent of `QueryExecutionErrors.emptyJsonFieldValueError`.
+ * Used for code control flow in the parser without overhead of creating a full exception.
Review Comment:
So we still use exceptions in the control flow, but use a trick to avoid the stacktrace overhead?
Can we avoid using stacktrace in the control flow? or at lease not in the critical code path. e.g. `fieldConverters` should return null directly if the input is empty string and `enablePartialResults` is true.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1308154891
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -420,17 +420,17 @@ class JacksonParser(
case VALUE_STRING if parser.getTextLength < 1 && allowEmptyString =>
dataType match {
case FloatType | DoubleType | TimestampType | DateType =>
- throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+ throw EmptyJsonFieldValueException(dataType)
case _ => null
}
case VALUE_STRING if parser.getTextLength < 1 =>
- throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+ throw EmptyJsonFieldValueException(dataType)
Review Comment:
Depending on the JSON object, `EmptyJsonFieldValueException` will be intercepted at convertObject, convertMap, or convertArray (because every JSON value will be treated as either of those three although I am not quite sure about from_json functions)and be wrapped into PartialResultException. This exception will eventually be converted into BadRecordException in `JacksonParser.parse` method and processed in `FailureSafeParser` according to the parsing mode.
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -420,17 +420,17 @@ class JacksonParser(
case VALUE_STRING if parser.getTextLength < 1 && allowEmptyString =>
dataType match {
case FloatType | DoubleType | TimestampType | DateType =>
- throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+ throw EmptyJsonFieldValueException(dataType)
case _ => null
}
case VALUE_STRING if parser.getTextLength < 1 =>
- throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+ throw EmptyJsonFieldValueException(dataType)
Review Comment:
I hope this answered your question.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1309658781
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -420,17 +420,17 @@ class JacksonParser(
case VALUE_STRING if parser.getTextLength < 1 && allowEmptyString =>
dataType match {
case FloatType | DoubleType | TimestampType | DateType =>
- throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+ throw EmptyJsonFieldValueException(dataType)
case _ => null
}
case VALUE_STRING if parser.getTextLength < 1 =>
- throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+ throw EmptyJsonFieldValueException(dataType)
Review Comment:
oh this is a good point, we should still throw exceptions to not break other features.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311087439
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,7 +454,20 @@ class JacksonParser(
schema.getFieldIndex(parser.getCurrentName) match {
case Some(index) =>
try {
- row.update(index, fieldConverters(index).apply(parser))
+ val value = try {
+ fieldConverters(index).apply(parser)
+ } catch {
+ case PartialResultException(partialResult, e) if enablePartialResults =>
Review Comment:
I have updated the code but it is now more difficult to read in my opinion.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1713103692
@dongjoon-hyun I reran the JSON benchmark and it seems like the previous results that I published were noisy. I confirmed there is no apparent regression in the patch.
I ran only `Json files in the per-line mode` benchmark for 10 iterations. Results:
Without the patch (latest master https://github.com/apache/spark/commit/eb0b09f0f2b518915421365a61d1f3d7d58b4404 with the patch reverted):
```
[info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
[info] Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
[info] Json files in the per-line mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Text read 463 476 15 10.8 92.6 1.0X
[info] Schema inferring 2126 2166 48 2.4 425.1 0.2X
[info] Parsing without charset 3195 3201 4 1.6 638.9 0.1X
[info] Parsing with UTF-8 4129 4140 8 1.2 825.8 0.1X
```
With the patch (latest master):
```
[info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
[info] Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
[info] Json files in the per-line mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Text read 459 467 7 10.9 91.8 1.0X
[info] Schema inferring 2159 2198 45 2.3 431.7 0.2X
[info] Parsing without charset 3106 3119 12 1.6 621.2 0.1X
[info] Parsing with UTF-8 4071 4090 10 1.2 814.2 0.1X
```
It seems the results are approximately the same as before. However, the benchmark results tend to fluctuate quite a lot. For example, when I reran the same benchmark without any code changes:
The second run with the patch:
```
[info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
[info] Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
[info] Json files in the per-line mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Text read 458 469 9 10.9 91.5 1.0X
[info] Schema inferring 2147 2184 48 2.3 429.4 0.2X
[info] Parsing without charset 3294 3308 10 1.5 658.8 0.1X
[info] Parsing with UTF-8 4437 4444 8 1.1 887.4 0.1X
```
I think it is fine and it is just noise in the benchmark, no apparent regression because of the patch.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311038797
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,7 +454,20 @@ class JacksonParser(
schema.getFieldIndex(parser.getCurrentName) match {
case Some(index) =>
try {
- row.update(index, fieldConverters(index).apply(parser))
+ val value = try {
+ fieldConverters(index).apply(parser)
+ } catch {
+ case PartialResultException(partialResult, e) if enablePartialResults =>
Review Comment:
Can we have a common trait of `PartialResultException`, `PartialArrayDataResultException` and `PartialMapDataResultException`, so that we don't need to duplicate code?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311171288
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,11 +454,17 @@ class JacksonParser(
schema.getFieldIndex(parser.getCurrentName) match {
case Some(index) =>
try {
- row.update(index, fieldConverters(index).apply(parser))
+ val value = fieldConverters(index).apply(parser)
+ row.update(index, value)
skipRow = structFilters.skipRow(row, index)
bitmask(index) = false
} catch {
case e: SparkUpgradeException => throw e
+ case err: PartialValueException if enablePartialResults =>
+ badRecordException = badRecordException.orElse(Some(err.cause))
+ row.update(index, err.partialResult)
Review Comment:
Actually, not. When we encounter PartialValueException/`err.partialResult`, it is already a complex value, e.g. struct/map/array. If there was an error parsing a primitive value, it would be caught in NonFatal clause and then rethrown as PartialValueException.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1311066933
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -454,7 +454,20 @@ class JacksonParser(
schema.getFieldIndex(parser.getCurrentName) match {
case Some(index) =>
try {
- row.update(index, fieldConverters(index).apply(parser))
+ val value = try {
+ fieldConverters(index).apply(parser)
+ } catch {
+ case PartialResultException(partialResult, e) if enablePartialResults =>
Review Comment:
This is a good idea, let me make the change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "sadikovi (via GitHub)" <gi...@apache.org>.
sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1701995162
cc @HyukjinKwon for review as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1306867310
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -420,17 +420,17 @@ class JacksonParser(
case VALUE_STRING if parser.getTextLength < 1 && allowEmptyString =>
dataType match {
case FloatType | DoubleType | TimestampType | DateType =>
- throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+ throw EmptyJsonFieldValueException(dataType)
case _ => null
}
case VALUE_STRING if parser.getTextLength < 1 =>
- throw QueryExecutionErrors.emptyJsonFieldValueError(dataType)
+ throw EmptyJsonFieldValueException(dataType)
Review Comment:
the control flow here is bit convoluted, when/where is `EmptyJsonFieldValueException` being handled first?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org