You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "Hisoka-X (via GitHub)" <gi...@apache.org> on 2023/05/08 12:45:38 UTC
[GitHub] [spark] Hisoka-X opened a new pull request, #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source
Hisoka-X opened a new pull request, #41091:
URL: https://github.com/apache/spark/pull/41091
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a faster review.
7. If you want to add a new configuration, please read the guideline first for naming configurations in
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
8. If you want to add or modify an error type or message, please read the guideline first in
'core/src/main/resources/error/README.md'.
-->
### What changes were proposed in this pull request?
Follow up https://github.com/apache/spark/pull/36562 , performance improvement when Timestamp type inference with legacy format.
In the current implementation of CSV/JSON data source, the schema inference with legacy format relies on methods that will throw exceptions if the fields can't convert as some data types .
Throwing and catching exceptions can be slow. We can improve it by creating methods that return optional results instead.
The optimization of DefaultTimestampFormatter has been implemented in https://github.com/apache/spark/pull/36562 , this PR adds the optimization of legacy format. The basic logic is to prevent the formatter from throwing exceptions, and then use catch to determine whether the parsing is successful.
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
2. If you fix some SQL features, you can provide some references of other DBMSes.
3. If there is design documentation, please add the link.
4. If there is a discussion in the mailing list, please add the link.
-->
### Why are the changes needed?
Performance improvement when Timestamp type inference with legacy format.
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
### Does this PR introduce _any_ user-facing change?
No
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
### How was this patch tested?
Add new test
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41091:
URL: https://github.com/apache/spark/pull/41091#issuecomment-1549617800
Thanks @MaxGekk
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41091:
URL: https://github.com/apache/spark/pull/41091#issuecomment-1542339003
> Do the benchmarks `CSVBenchmark` and `JsonBenchmark` show any improvements? Could you regenerate the results `JsonBenchmark.*.txt` and `CSVBenchmark.*.txt`, please.
I'm doing benchmarks, but I found a problem, like @gengliangwang said in https://github.com/apache/spark/pull/36562#issuecomment-1127600354 .The benchmarks no case for `the string inputs are not valid timestamps`. The speed up only work when `string input are not valid timestamps`. I'm worry about the benchmarks can't prove anything. Can I create a PR for add benchmarks for `type inference when string input are not valid timestamps`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #41091:
URL: https://github.com/apache/spark/pull/41091#issuecomment-1542426882
> Can I create a PR for add benchmarks for type inference when string input are not valid timestamps
Yep. Let's do that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41091:
URL: https://github.com/apache/spark/pull/41091#issuecomment-1546830139
> @Hisoka-X Please, resolve the conflicts and rebase on the recent master.
Ok, I will add benchmarks for this too. Please wait. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #41091:
URL: https://github.com/apache/spark/pull/41091#issuecomment-1546829442
@Hisoka-X Please, resolve the conflicts and rebase on the recent master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk closed pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source
URL: https://github.com/apache/spark/pull/41091
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #41091:
URL: https://github.com/apache/spark/pull/41091#issuecomment-1549612885
+1, LGTM. Merging to master.
Thank you, @Hisoka-X.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on a diff in pull request #41091: [SPARK-39281][SQL] Speed up Timestamp type inference with legacy format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on code in PR #41091:
URL: https://github.com/apache/spark/pull/41091#discussion_r1193139144
##########
sql/core/benchmarks/JsonBenchmark-results.txt:
##########
@@ -4,120 +4,120 @@ Benchmark for performance of JSON parsing
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
JSON schema inferring: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 3871 3914 69 1.3 774.2 1.0X
-UTF-8 is set 5539 5563 26 0.9 1107.8 0.7X
+No encoding 3720 3843 121 1.3 743.9 1.0X
+UTF-8 is set 5412 5455 45 0.9 1082.4 0.7X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
count a short column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 2984 2999 24 1.7 596.9 1.0X
-UTF-8 is set 4875 4928 46 1.0 975.0 0.6X
+No encoding 3234 3254 33 1.5 646.7 1.0X
+UTF-8 is set 4847 4868 21 1.0 969.5 0.7X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 6353 6446 143 0.2 6353.4 1.0X
-UTF-8 is set 10548 10647 163 0.1 10547.8 0.6X
+No encoding 5702 5794 101 0.2 5702.1 1.0X
+UTF-8 is set 9526 9607 73 0.1 9526.1 0.6X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
select wide row: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 18807 18880 66 0.0 376130.9 1.0X
-UTF-8 is set 20530 20554 23 0.0 410593.2 0.9X
+No encoding 18318 18448 199 0.0 366367.7 1.0X
+UTF-8 is set 19791 19887 99 0.0 395817.1 0.9X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Select a subset of 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Select 10 columns 2741 2749 12 0.4 2740.6 1.0X
-Select 1 column 1916 1925 8 0.5 1916.5 1.4X
+Select 10 columns 2531 2570 51 0.4 2531.3 1.0X
+Select 1 column 1867 1882 16 0.5 1867.0 1.4X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
creation of JSON parser per line: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Short column without encoding 901 934 29 1.1 900.8 1.0X
-Short column with UTF-8 1320 1343 31 0.8 1319.9 0.7X
-Wide column without encoding 13446 13544 103 0.1 13445.8 0.1X
-Wide column with UTF-8 17770 17854 76 0.1 17770.0 0.1X
+Short column without encoding 868 875 7 1.2 868.4 1.0X
+Short column with UTF-8 1151 1163 11 0.9 1150.9 0.8X
+Wide column without encoding 12063 12299 205 0.1 12063.0 0.1X
+Wide column with UTF-8 16095 16136 51 0.1 16095.3 0.1X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 159 167 9 6.3 159.2 1.0X
-from_json 2844 2863 25 0.4 2844.1 0.1X
-json_tuple 3137 3161 23 0.3 3136.7 0.1X
-get_json_object 2874 2884 9 0.3 2874.2 0.1X
+Text read 165 170 4 6.1 164.7 1.0X
+from_json 2339 2386 77 0.4 2338.9 0.1X
+json_tuple 2667 2730 55 0.4 2667.3 0.1X
+get_json_object 2627 2659 32 0.4 2627.1 0.1X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Dataset of json strings: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 732 745 11 6.8 146.3 1.0X
-schema inferring 3260 3265 6 1.5 652.0 0.2X
-parsing 3592 3645 46 1.4 718.4 0.2X
+Text read 700 715 20 7.1 140.1 1.0X
+schema inferring 3144 3166 20 1.6 628.7 0.2X
+parsing 3261 3271 9 1.5 652.1 0.2X
Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Json files in the per-line mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 1092 1100 11 4.6 218.4 1.0X
-Schema inferring 3814 3826 15 1.3 762.8 0.3X
-Parsing without charset 4153 4184 32 1.2 830.7 0.3X
-Parsing with UTF-8 6014 6035 22 0.8 1202.9 0.2X
+Text read 1096 1105 12 4.6 219.1 1.0X
+Schema inferring 3818 3830 16 1.3 763.6 0.3X
+Parsing without charset 4107 4137 32 1.2 821.4 0.3X
+Parsing with UTF-8 5717 5763 41 0.9 1143.3 0.2X
OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Create a dataset of timestamps 193 198 4 5.2 193.5 1.0X
-to_json(timestamp) 1566 1582 14 0.6 1566.4 0.1X
-write timestamps to files 1265 1274 14 0.8 1265.1 0.2X
-Create a dataset of dates 232 239 10 4.3 231.9 0.8X
-to_json(date) 1037 1058 18 1.0 1037.2 0.2X
-write dates to files 766 770 7 1.3 765.6 0.3X
+Create a dataset of timestamps 199 202 3 5.0 198.9 1.0X
+to_json(timestamp) 1458 1487 26 0.7 1458.0 0.1X
+write timestamps to files 1232 1262 26 0.8 1232.5 0.2X
+Create a dataset of dates 231 237 5 4.3 230.8 0.9X
+to_json(date) 956 966 10 1.0 956.5 0.2X
+write dates to files 785 793 10 1.3 785.4 0.3X
OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
+Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------------------
-read timestamp text from files 283 289 6 3.5 283.1 1.0X
-read timestamps from files 3364 3431 60 0.3 3363.6 0.1X
-infer timestamps from files 8913 8935 38 0.1 8912.6 0.0X
-read date text from files 263 267 4 3.8 262.9 1.1X
-read date from files 1102 1116 12 0.9 1101.7 0.3X
-timestamp strings 412 426 14 2.4 412.0 0.7X
-parse timestamps from Dataset[String] 3941 3956 14 0.3 3940.8 0.1X
-infer timestamps from Dataset[String] 9334 9383 43 0.1 9333.8 0.0X
-date strings 469 484 24 2.1 469.3 0.6X
-parse dates from Dataset[String] 1565 1572 11 0.6 1564.8 0.2X
-from_json(timestamp) 5825 5917 88 0.2 5824.5 0.0X
-from_json(date) 3553 3574 19 0.3 3553.1 0.1X
-infer error timestamps from Dataset[String] with default format 2590 2609 19 0.4 2589.9 0.1X
-infer error timestamps from Dataset[String] with user-provided format 2517 2551 30 0.4 2516.8 0.1X
-infer error timestamps from Dataset[String] with legacy format 6836 6876 63 0.1 6836.1 0.0X
Review Comment:
@MaxGekk Hi, I updated the benchmark, the speed already up.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org