You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "Hisoka-X (via GitHub)" <gi...@apache.org> on 2023/05/08 12:45:38 UTC

[GitHub] [spark] Hisoka-X opened a new pull request, #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source

Hisoka-X opened a new pull request, #41091:
URL: https://github.com/apache/spark/pull/41091

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   Follow up https://github.com/apache/spark/pull/36562 , performance improvement when Timestamp type inference with legacy format.
   
   In the current implementation of CSV/JSON data source, the schema inference with legacy format relies on methods that will throw exceptions if the fields can't convert as some data types .
   
   Throwing and catching exceptions can be slow. We can improve it by creating methods that return optional results instead.
   
   The optimization of DefaultTimestampFormatter has been implemented in https://github.com/apache/spark/pull/36562 , this PR adds the optimization of legacy format. The basic logic is to prevent the formatter from throwing exceptions, and then use catch to determine whether the parsing is successful.
   
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   
   Performance improvement when Timestamp type inference with legacy format.
   
   
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   Add new test
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41091:
URL: https://github.com/apache/spark/pull/41091#issuecomment-1549617800

   Thanks @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41091:
URL: https://github.com/apache/spark/pull/41091#issuecomment-1542339003

   > Do the benchmarks `CSVBenchmark` and `JsonBenchmark` show any improvements? Could you regenerate the results `JsonBenchmark.*.txt` and `CSVBenchmark.*.txt`, please.
   
   I'm doing benchmarks, but I found a problem, like @gengliangwang said in https://github.com/apache/spark/pull/36562#issuecomment-1127600354 .The benchmarks no case for `the string inputs are not valid timestamps`. The speed up only work when `string input are not valid timestamps`. I'm worry about the benchmarks can't prove anything. Can I create a PR for add benchmarks for `type inference when string input are not valid timestamps`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #41091:
URL: https://github.com/apache/spark/pull/41091#issuecomment-1542426882

   > Can I create a PR for add benchmarks for type inference when string input are not valid timestamps
   
   Yep. Let's do that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41091:
URL: https://github.com/apache/spark/pull/41091#issuecomment-1546830139

   > @Hisoka-X Please, resolve the conflicts and rebase on the recent master.
   
   Ok, I will add benchmarks for this too. Please wait. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #41091:
URL: https://github.com/apache/spark/pull/41091#issuecomment-1546829442

   @Hisoka-X Please, resolve the conflicts and rebase on the recent master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk closed pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk closed pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format  in JSON/CSV data source
URL: https://github.com/apache/spark/pull/41091


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #41091:
URL: https://github.com/apache/spark/pull/41091#issuecomment-1549612885

   +1, LGTM. Merging to master.
   Thank you, @Hisoka-X.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on a diff in pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on code in PR #41091:
URL: https://github.com/apache/spark/pull/41091#discussion_r1193139144


##########
sql/core/benchmarks/JsonBenchmark-results.txt:
##########
@@ -4,120 +4,120 @@ Benchmark for performance of JSON parsing
 
 Preparing data for benchmarking ...
 OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 JSON schema inferring:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        3871           3914          69          1.3         774.2       1.0X
-UTF-8 is set                                       5539           5563          26          0.9        1107.8       0.7X
+No encoding                                        3720           3843         121          1.3         743.9       1.0X
+UTF-8 is set                                       5412           5455          45          0.9        1082.4       0.7X
 
 Preparing data for benchmarking ...
 OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 count a short column:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        2984           2999          24          1.7         596.9       1.0X
-UTF-8 is set                                       4875           4928          46          1.0         975.0       0.6X
+No encoding                                        3234           3254          33          1.5         646.7       1.0X
+UTF-8 is set                                       4847           4868          21          1.0         969.5       0.7X
 
 Preparing data for benchmarking ...
 OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 count a wide column:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        6353           6446         143          0.2        6353.4       1.0X
-UTF-8 is set                                      10548          10647         163          0.1       10547.8       0.6X
+No encoding                                        5702           5794         101          0.2        5702.1       1.0X
+UTF-8 is set                                       9526           9607          73          0.1        9526.1       0.6X
 
 Preparing data for benchmarking ...
 OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 select wide row:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                       18807          18880          66          0.0      376130.9       1.0X
-UTF-8 is set                                      20530          20554          23          0.0      410593.2       0.9X
+No encoding                                       18318          18448         199          0.0      366367.7       1.0X
+UTF-8 is set                                      19791          19887          99          0.0      395817.1       0.9X
 
 Preparing data for benchmarking ...
 OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 Select a subset of 10 columns:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Select 10 columns                                  2741           2749          12          0.4        2740.6       1.0X
-Select 1 column                                    1916           1925           8          0.5        1916.5       1.4X
+Select 10 columns                                  2531           2570          51          0.4        2531.3       1.0X
+Select 1 column                                    1867           1882          16          0.5        1867.0       1.4X
 
 Preparing data for benchmarking ...
 OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 creation of JSON parser per line:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Short column without encoding                       901            934          29          1.1         900.8       1.0X
-Short column with UTF-8                            1320           1343          31          0.8        1319.9       0.7X
-Wide column without encoding                      13446          13544         103          0.1       13445.8       0.1X
-Wide column with UTF-8                            17770          17854          76          0.1       17770.0       0.1X
+Short column without encoding                       868            875           7          1.2         868.4       1.0X
+Short column with UTF-8                            1151           1163          11          0.9        1150.9       0.8X
+Wide column without encoding                      12063          12299         205          0.1       12063.0       0.1X
+Wide column with UTF-8                            16095          16136          51          0.1       16095.3       0.1X
 
 Preparing data for benchmarking ...
 OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                           159            167           9          6.3         159.2       1.0X
-from_json                                          2844           2863          25          0.4        2844.1       0.1X
-json_tuple                                         3137           3161          23          0.3        3136.7       0.1X
-get_json_object                                    2874           2884           9          0.3        2874.2       0.1X
+Text read                                           165            170           4          6.1         164.7       1.0X
+from_json                                          2339           2386          77          0.4        2338.9       0.1X
+json_tuple                                         2667           2730          55          0.4        2667.3       0.1X
+get_json_object                                    2627           2659          32          0.4        2627.1       0.1X
 
 Preparing data for benchmarking ...
 OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 Dataset of json strings:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                           732            745          11          6.8         146.3       1.0X
-schema inferring                                   3260           3265           6          1.5         652.0       0.2X
-parsing                                            3592           3645          46          1.4         718.4       0.2X
+Text read                                           700            715          20          7.1         140.1       1.0X
+schema inferring                                   3144           3166          20          1.6         628.7       0.2X
+parsing                                            3261           3271           9          1.5         652.1       0.2X
 
 Preparing data for benchmarking ...
 OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 Json files in the per-line mode:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                          1092           1100          11          4.6         218.4       1.0X
-Schema inferring                                   3814           3826          15          1.3         762.8       0.3X
-Parsing without charset                            4153           4184          32          1.2         830.7       0.3X
-Parsing with UTF-8                                 6014           6035          22          0.8        1202.9       0.2X
+Text read                                          1096           1105          12          4.6         219.1       1.0X
+Schema inferring                                   3818           3830          16          1.3         763.6       0.3X
+Parsing without charset                            4107           4137          32          1.2         821.4       0.3X
+Parsing with UTF-8                                 5717           5763          41          0.9        1143.3       0.2X
 
 OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 Write dates and timestamps:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Create a dataset of timestamps                      193            198           4          5.2         193.5       1.0X
-to_json(timestamp)                                 1566           1582          14          0.6        1566.4       0.1X
-write timestamps to files                          1265           1274          14          0.8        1265.1       0.2X
-Create a dataset of dates                           232            239          10          4.3         231.9       0.8X
-to_json(date)                                      1037           1058          18          1.0        1037.2       0.2X
-write dates to files                                766            770           7          1.3         765.6       0.3X
+Create a dataset of timestamps                      199            202           3          5.0         198.9       1.0X
+to_json(timestamp)                                 1458           1487          26          0.7        1458.0       0.1X
+write timestamps to files                          1232           1262          26          0.8        1232.5       0.2X
+Create a dataset of dates                           231            237           5          4.3         230.8       0.9X
+to_json(date)                                       956            966          10          1.0         956.5       0.2X
+write dates to files                                785            793          10          1.3         785.4       0.3X
 
 OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 Read dates and timestamps:                                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -----------------------------------------------------------------------------------------------------------------------------------------------------
-read timestamp text from files                                                   283            289           6          3.5         283.1       1.0X
-read timestamps from files                                                      3364           3431          60          0.3        3363.6       0.1X
-infer timestamps from files                                                     8913           8935          38          0.1        8912.6       0.0X
-read date text from files                                                        263            267           4          3.8         262.9       1.1X
-read date from files                                                            1102           1116          12          0.9        1101.7       0.3X
-timestamp strings                                                                412            426          14          2.4         412.0       0.7X
-parse timestamps from Dataset[String]                                           3941           3956          14          0.3        3940.8       0.1X
-infer timestamps from Dataset[String]                                           9334           9383          43          0.1        9333.8       0.0X
-date strings                                                                     469            484          24          2.1         469.3       0.6X
-parse dates from Dataset[String]                                                1565           1572          11          0.6        1564.8       0.2X
-from_json(timestamp)                                                            5825           5917          88          0.2        5824.5       0.0X
-from_json(date)                                                                 3553           3574          19          0.3        3553.1       0.1X
-infer error timestamps from Dataset[String] with default format                 2590           2609          19          0.4        2589.9       0.1X
-infer error timestamps from Dataset[String] with user-provided format           2517           2551          30          0.4        2516.8       0.1X
-infer error timestamps from Dataset[String] with legacy format                  6836           6876          63          0.1        6836.1       0.0X

Review Comment:
   @MaxGekk Hi, I updated the benchmark, the speed already up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org