You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "sadikovi (via GitHub)" <gi...@apache.org> on 2023/09/04 21:42:18 UTC
[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1315208847
##########
sql/core/benchmarks/JsonBenchmark-results.txt:
##########
@@ -3,121 +3,125 @@ Benchmark for performance of JSON parsing
================================================================================================
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
JSON schema inferring: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 3720 3843 121 1.3 743.9 1.0X
-UTF-8 is set 5412 5455 45 0.9 1082.4 0.7X
+No encoding 2084 2134 46 2.4 416.8 1.0X
+UTF-8 is set 3077 3093 14 1.6 615.3 0.7X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
count a short column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 3234 3254 33 1.5 646.7 1.0X
-UTF-8 is set 4847 4868 21 1.0 969.5 0.7X
+No encoding 2854 2863 8 1.8 570.8 1.0X
+UTF-8 is set 4066 4066 1 1.2 813.1 0.7X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 5702 5794 101 0.2 5702.1 1.0X
-UTF-8 is set 9526 9607 73 0.1 9526.1 0.6X
+No encoding 3348 3368 26 0.3 3347.8 1.0X
+UTF-8 is set 5215 5239 22 0.2 5214.7 0.6X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
select wide row: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 18318 18448 199 0.0 366367.7 1.0X
-UTF-8 is set 19791 19887 99 0.0 395817.1 0.9X
+No encoding 11046 11102 54 0.0 220928.4 1.0X
+UTF-8 is set 12135 12181 54 0.0 242697.4 0.9X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Select a subset of 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Select 10 columns 2531 2570 51 0.4 2531.3 1.0X
-Select 1 column 1867 1882 16 0.5 1867.0 1.4X
+Select 10 columns 2486 2488 2 0.4 2486.5 1.0X
+Select 1 column 1505 1506 2 0.7 1504.6 1.7X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
creation of JSON parser per line: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Short column without encoding 868 875 7 1.2 868.4 1.0X
-Short column with UTF-8 1151 1163 11 0.9 1150.9 0.8X
-Wide column without encoding 12063 12299 205 0.1 12063.0 0.1X
-Wide column with UTF-8 16095 16136 51 0.1 16095.3 0.1X
+Short column without encoding 888 889 3 1.1 887.6 1.0X
+Short column with UTF-8 1134 1136 2 0.9 1134.3 0.8X
+Wide column without encoding 8012 8056 51 0.1 8012.4 0.1X
+Wide column with UTF-8 9830 9844 22 0.1 9829.7 0.1X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 165 170 4 6.1 164.7 1.0X
-from_json 2339 2386 77 0.4 2338.9 0.1X
-json_tuple 2667 2730 55 0.4 2667.3 0.1X
-get_json_object 2627 2659 32 0.4 2627.1 0.1X
+Text read 85 87 2 11.7 85.4 1.0X
+from_json 1706 1711 4 0.6 1706.4 0.1X
+json_tuple 1528 1534 7 0.7 1528.2 0.1X
+get_json_object 1275 1286 17 0.8 1275.0 0.1X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Dataset of json strings: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 700 715 20 7.1 140.1 1.0X
-schema inferring 3144 3166 20 1.6 628.7 0.2X
-parsing 3261 3271 9 1.5 652.1 0.2X
+Text read 369 370 1 13.6 73.8 1.0X
+schema inferring 1880 1883 4 2.7 376.0 0.2X
+parsing 3731 3737 8 1.3 746.1 0.1X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Json files in the per-line mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 1096 1105 12 4.6 219.1 1.0X
-Schema inferring 3818 3830 16 1.3 763.6 0.3X
-Parsing without charset 4107 4137 32 1.2 821.4 0.3X
-Parsing with UTF-8 5717 5763 41 0.9 1143.3 0.2X
+Text read 553 579 32 9.0 110.6 1.0X
+Schema inferring 2195 2196 2 2.3 439.0 0.3X
+Parsing without charset 4272 4274 3 1.2 854.3 0.1X
Review Comment:
The overall ratio decreased because read text benchmark was 2x faster than before. Let me double check this. Now that you pointed it out, I am curious why the benchmark itself did not improve on a faster CPU when other cases did. If I confirm it is a regression, I will follow up on this in a separate PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org