You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "sadikovi (via GitHub)" <gi...@apache.org> on 2023/09/04 21:42:18 UTC

[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1315208847


##########
sql/core/benchmarks/JsonBenchmark-results.txt:
##########
@@ -3,121 +3,125 @@ Benchmark for performance of JSON parsing
 ================================================================================================
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 JSON schema inferring:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        3720           3843         121          1.3         743.9       1.0X
-UTF-8 is set                                       5412           5455          45          0.9        1082.4       0.7X
+No encoding                                        2084           2134          46          2.4         416.8       1.0X
+UTF-8 is set                                       3077           3093          14          1.6         615.3       0.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 count a short column:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        3234           3254          33          1.5         646.7       1.0X
-UTF-8 is set                                       4847           4868          21          1.0         969.5       0.7X
+No encoding                                        2854           2863           8          1.8         570.8       1.0X
+UTF-8 is set                                       4066           4066           1          1.2         813.1       0.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 count a wide column:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        5702           5794         101          0.2        5702.1       1.0X
-UTF-8 is set                                       9526           9607          73          0.1        9526.1       0.6X
+No encoding                                        3348           3368          26          0.3        3347.8       1.0X
+UTF-8 is set                                       5215           5239          22          0.2        5214.7       0.6X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 select wide row:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                       18318          18448         199          0.0      366367.7       1.0X
-UTF-8 is set                                      19791          19887          99          0.0      395817.1       0.9X
+No encoding                                       11046          11102          54          0.0      220928.4       1.0X
+UTF-8 is set                                      12135          12181          54          0.0      242697.4       0.9X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 Select a subset of 10 columns:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Select 10 columns                                  2531           2570          51          0.4        2531.3       1.0X
-Select 1 column                                    1867           1882          16          0.5        1867.0       1.4X
+Select 10 columns                                  2486           2488           2          0.4        2486.5       1.0X
+Select 1 column                                    1505           1506           2          0.7        1504.6       1.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 creation of JSON parser per line:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Short column without encoding                       868            875           7          1.2         868.4       1.0X
-Short column with UTF-8                            1151           1163          11          0.9        1150.9       0.8X
-Wide column without encoding                      12063          12299         205          0.1       12063.0       0.1X
-Wide column with UTF-8                            16095          16136          51          0.1       16095.3       0.1X
+Short column without encoding                       888            889           3          1.1         887.6       1.0X
+Short column with UTF-8                            1134           1136           2          0.9        1134.3       0.8X
+Wide column without encoding                       8012           8056          51          0.1        8012.4       0.1X
+Wide column with UTF-8                             9830           9844          22          0.1        9829.7       0.1X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                           165            170           4          6.1         164.7       1.0X
-from_json                                          2339           2386          77          0.4        2338.9       0.1X
-json_tuple                                         2667           2730          55          0.4        2667.3       0.1X
-get_json_object                                    2627           2659          32          0.4        2627.1       0.1X
+Text read                                            85             87           2         11.7          85.4       1.0X
+from_json                                          1706           1711           4          0.6        1706.4       0.1X
+json_tuple                                         1528           1534           7          0.7        1528.2       0.1X
+get_json_object                                    1275           1286          17          0.8        1275.0       0.1X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 Dataset of json strings:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                           700            715          20          7.1         140.1       1.0X
-schema inferring                                   3144           3166          20          1.6         628.7       0.2X
-parsing                                            3261           3271           9          1.5         652.1       0.2X
+Text read                                           369            370           1         13.6          73.8       1.0X
+schema inferring                                   1880           1883           4          2.7         376.0       0.2X
+parsing                                            3731           3737           8          1.3         746.1       0.1X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 Json files in the per-line mode:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                          1096           1105          12          4.6         219.1       1.0X
-Schema inferring                                   3818           3830          16          1.3         763.6       0.3X
-Parsing without charset                            4107           4137          32          1.2         821.4       0.3X
-Parsing with UTF-8                                 5717           5763          41          0.9        1143.3       0.2X
+Text read                                           553            579          32          9.0         110.6       1.0X
+Schema inferring                                   2195           2196           2          2.3         439.0       0.3X
+Parsing without charset                            4272           4274           3          1.2         854.3       0.1X

Review Comment:
   The overall ratio decreased because read text benchmark was 2x faster than before. Let me double check this. Now that you pointed it out, I am curious why the benchmark itself did not improve on a faster CPU when other cases did. If I confirm it is a regression, I will follow up on this in a separate PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org