You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "Hisoka-X (via GitHub)" <gi...@apache.org> on 2023/05/07 06:07:20 UTC
[GitHub] [spark] Hisoka-X opened a new pull request, #41078: [SPARK-39280][SQL] Fasten Timestamp type inference with user-provided format in JSON/CSV data source
Hisoka-X opened a new pull request, #41078:
URL: https://github.com/apache/spark/pull/41078
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a faster review.
7. If you want to add a new configuration, please read the guideline first for naming configurations in
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
8. If you want to add or modify an error type or message, please read the guideline first in
'core/src/main/resources/error/README.md'.
-->
### What changes were proposed in this pull request?
Follow up #36562 , performance improvement when Timestamp type inference with user-provided format.
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
2. If you fix some SQL features, you can provide some references of other DBMSes.
3. If there is design documentation, please add the link.
4. If there is a discussion in the mailing list, please add the link.
-->
### Why are the changes needed?
Performance improvement when Timestamp type inference with user-provided format.
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
### Does this PR introduce _any_ user-facing change?
No
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
### How was this patch tested?
Add new test
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192110202
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
pattern, zoneId, locale, legacyFormat)
+ override def parseOptional(s: String): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
Review Comment:
It will be check in https://github.com/apache/spark/pull/41078/files/77c78b1f0ed3f7ee3c301eec2fa543a048c35531#diff-b42bcba727feeebf78f0e5540f2d4f6c6a38afd2225e4ebeae22a604e42eb094R183-R188
. The logic can't go wrong. But in this situation, the speed up will not work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1537595353
cc @sadikovi too
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk closed pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
URL: https://github.com/apache/spark/pull/41078
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192102884
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
pattern, zoneId, locale, legacyFormat)
+ override def parseOptional(s: String): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
Review Comment:
`parseUnresolved()` doesn't validate correctness of result, see its docs:
```
The result of this method is TemporalAccessor which represents the data as seen in the input. Values are not validated, thus parsing a date string of '2012-00-65' would result in a temporal with three fields - year of '2012', month of '0' and day-of-month of '65'.
```
I think you should check the result yourself somewhere.
Could you re-check the example from the docs: '2012-00-65' and add a test, please.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1546657983
@MaxGekk I adjusted the code by your suggestion. By the way, maybe we should use spotless to avoid problem like this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1541990704
The same question as for https://github.com/apache/spark/pull/41091:
_Do the benchmarks `CSVBenchmark` and `JsonBenchmark` show any improvements? Could you regenerate the results `JsonBenchmark.*.txt` and `CSVBenchmark.*.txt`, please._
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1546829555
Thanks @MaxGekk for your patience and @srowen @HyukjinKwon.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1537441000
Can you fill out the JIRA, and add a little bit of explanation here?
It seems like you just avoid trying to extract more fields if basic parsing fails?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192102884
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
pattern, zoneId, locale, legacyFormat)
+ override def parseOptional(s: String): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
Review Comment:
`parseUnresolved()` doesn't validate correctness of its result, see its docs:
```
The result of this method is TemporalAccessor which represents the data as seen in the input. Values are not validated, thus parsing a date string of '2012-00-65' would result in a temporal with three fields - year of '2012', month of '0' and day-of-month of '65'.
```
I think you should check the result yourself somewhere.
Could you re-check the example from the docs: '2012-00-65' and add a test, please.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192178990
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
pattern, zoneId, locale, legacyFormat)
+ override def parseOptional(s: String): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
Review Comment:
> Could you re-check the example from the docs: '2012-00-65' and add a test, please.
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1545815468
> @Hisoka-X Could you highlight in PR description how much does it become faster. Please, put some numbers to the section `Why are the changes needed?`.
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192342220
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
pattern, zoneId, locale, legacyFormat)
+ override def parseOptional(s: String): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+ if (parsed != null) {
+ val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+ Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))
Review Comment:
Could you move the common part:
```scala
Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
```
to `extractSeconds()`
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
pattern, zoneId, locale, legacyFormat)
+ override def parseOptional(s: String): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+ if (parsed != null) {
+ val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+ Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))
+ } else {
+ None
+ }
+ } catch {
+ case NonFatal(_) => None
+ }
+ }
+
+ private def extractSeconds(parsed: TemporalAccessor): (Long, Long) = {
+ val parsedZoneId = parsed.query(TemporalQueries.zone())
+ val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+ val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+ val epochSeconds = zonedDateTime.toEpochSecond
+ val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+ (epochSeconds, microsOfSecond)
+ }
Review Comment:
```suggestion
private def extractMicros(parsed: TemporalAccessor): Long = {
val parsedZoneId = parsed.query(TemporalQueries.zone())
val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
val epochSeconds = zonedDateTime.toEpochSecond
val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1546828727
+1, LGTM. Merging to master.
Thank you, @Hisoka-X and @srowen @HyukjinKwon for review.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192986251
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,63 @@ class Iso8601TimestampFormatter(
protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
pattern, zoneId, locale, legacyFormat)
+ override def parseOptional(s: String): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+ if (parsed != null) {
+ Some(extractMicros(parsed))
+ } else {
+ None
+ }
+ } catch {
+ case NonFatal(_) => None
+ }
+ }
+
+ private def extractMicros(parsed: TemporalAccessor): Long = {
+ val parsedZoneId = parsed.query(TemporalQueries.zone())
+ val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+ val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+ val epochSeconds = zonedDateTime.toEpochSecond
+ val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+ Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+ }
+
override def parse(s: String): Long = {
try {
val parsed = formatter.parse(s)
- val parsedZoneId = parsed.query(TemporalQueries.zone())
- val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
- val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
- val epochSeconds = zonedDateTime.toEpochSecond
- val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
-
- Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+ extractMicros(parsed)
} catch checkParsedDiff(s, legacyFormatter.parse)
}
+ override def parseWithoutTimeZoneOptional(s: String, allowTimeZone: Boolean): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+ if (parsed != null) {
+ val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
+ Some(DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime)))
+ } else {
+ None
+ }
+ } catch {
+ case NonFatal(_) => None
+ }
+ }
+
+ private def extractDateAndTime(s: String, parsed: TemporalAccessor, allowTimeZone: Boolean):
+ (LocalDate, LocalTime) = {
+ if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
+ throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
+ }
+ val localDate = toLocalDate(parsed)
+ val localTime = toLocalTime(parsed)
+ (localDate, localTime)
+ }
Review Comment:
Thanks for remind, I missed this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1545800041
@Hisoka-X Could you highlight in PR description how much does it become faster. Please, put some numbers to the section `Why are the changes needed?`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1186872738
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
pattern, zoneId, locale, legacyFormat)
+ override def parseOptional(s: String): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+ if (parsed != null) {
+ val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+ Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))
+ } else {
+ None
+ }
+ } catch {
+ case NonFatal(_) => None
Review Comment:
Yes, we want to avoiding exceptions like `DateTimeParseException`, it cause by field value can't be parsed by format. There only catch JVM error like `VirtualMachineError`, same opeartion with `DateTimeUtils.stringToTimestamp`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1186870024
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
pattern, zoneId, locale, legacyFormat)
+ override def parseOptional(s: String): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+ if (parsed != null) {
+ val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+ Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))
+ } else {
+ None
+ }
+ } catch {
+ case NonFatal(_) => None
Review Comment:
You're saying you're avoiding exceptions, but then this catches exceptions. Is this just the fall-back case for unexpected errors?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Fasten Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1537302656
cc @gengliangwang @HyukjinKwon @srowen
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192410333
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
pattern, zoneId, locale, legacyFormat)
+ override def parseOptional(s: String): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+ if (parsed != null) {
+ val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+ Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))
+ } else {
+ None
+ }
+ } catch {
+ case NonFatal(_) => None
+ }
+ }
+
+ private def extractSeconds(parsed: TemporalAccessor): (Long, Long) = {
+ val parsedZoneId = parsed.query(TemporalQueries.zone())
+ val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+ val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+ val epochSeconds = zonedDateTime.toEpochSecond
+ val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+ (epochSeconds, microsOfSecond)
+ }
Review Comment:
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192941467
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,63 @@ class Iso8601TimestampFormatter(
protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
pattern, zoneId, locale, legacyFormat)
+ override def parseOptional(s: String): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+ if (parsed != null) {
+ Some(extractMicros(parsed))
+ } else {
+ None
+ }
+ } catch {
+ case NonFatal(_) => None
+ }
+ }
+
+ private def extractMicros(parsed: TemporalAccessor): Long = {
+ val parsedZoneId = parsed.query(TemporalQueries.zone())
+ val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+ val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+ val epochSeconds = zonedDateTime.toEpochSecond
+ val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+ Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+ }
+
override def parse(s: String): Long = {
try {
val parsed = formatter.parse(s)
- val parsedZoneId = parsed.query(TemporalQueries.zone())
- val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
- val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
- val epochSeconds = zonedDateTime.toEpochSecond
- val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
-
- Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+ extractMicros(parsed)
} catch checkParsedDiff(s, legacyFormatter.parse)
}
+ override def parseWithoutTimeZoneOptional(s: String, allowTimeZone: Boolean): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+ if (parsed != null) {
+ val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
+ Some(DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime)))
+ } else {
+ None
+ }
+ } catch {
+ case NonFatal(_) => None
+ }
+ }
+
+ private def extractDateAndTime(s: String, parsed: TemporalAccessor, allowTimeZone: Boolean):
+ (LocalDate, LocalTime) = {
+ if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
+ throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
+ }
+ val localDate = toLocalDate(parsed)
+ val localTime = toLocalTime(parsed)
+ (localDate, localTime)
+ }
Review Comment:
Let's deduplicate the code, and put one more common line here:
```suggestion
private def extractMicrosNTZ(
s: String,
parsed: TemporalAccessor,
allowTimeZone: Boolean): Long = {
if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
}
val localDate = toLocalDate(parsed)
val localTime = toLocalTime(parsed)
DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime))
}
```
##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala:
##########
@@ -472,4 +472,24 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite {
assert(
formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
}
+
+ test("SPARK-39280: support returning optional parse results in the iso8601 formatter") {
+ val formatter = new Iso8601TimestampFormatter(
+ "yyyy-MM-dd HH:mm:ss.SSSS",
+ locale = DateFormatter.defaultLocale,
+ legacyFormat = LegacyDateFormats.SIMPLE_DATE_FORMAT,
+ isParsing = true, zoneId = DateTimeTestUtils.LA)
+ assert(formatter.parseOptional("9999-12-31 23:59:59.9990").contains(253402329599999000L))
+ assert(
+ formatter.parseWithoutTimeZoneOptional("9999-12-31 23:59:59.9990", false)
+ .contains(253402300799999000L))
Review Comment:
```suggestion
assert(formatter.parseWithoutTimeZoneOptional("9999-12-31 23:59:59.9990", false)
.contains(253402300799999000L))
```
##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala:
##########
@@ -472,4 +472,24 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite {
assert(
formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
}
+
+ test("SPARK-39280: support returning optional parse results in the iso8601 formatter") {
+ val formatter = new Iso8601TimestampFormatter(
+ "yyyy-MM-dd HH:mm:ss.SSSS",
+ locale = DateFormatter.defaultLocale,
+ legacyFormat = LegacyDateFormats.SIMPLE_DATE_FORMAT,
+ isParsing = true, zoneId = DateTimeTestUtils.LA)
+ assert(formatter.parseOptional("9999-12-31 23:59:59.9990").contains(253402329599999000L))
+ assert(
+ formatter.parseWithoutTimeZoneOptional("9999-12-31 23:59:59.9990", false)
+ .contains(253402300799999000L))
+ assert(formatter.parseOptional("abc").isEmpty)
+ assert(
+ formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
Review Comment:
```suggestion
assert(formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
```
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,63 @@ class Iso8601TimestampFormatter(
protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
pattern, zoneId, locale, legacyFormat)
+ override def parseOptional(s: String): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+ if (parsed != null) {
+ Some(extractMicros(parsed))
+ } else {
+ None
+ }
+ } catch {
+ case NonFatal(_) => None
+ }
+ }
+
+ private def extractMicros(parsed: TemporalAccessor): Long = {
+ val parsedZoneId = parsed.query(TemporalQueries.zone())
+ val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+ val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+ val epochSeconds = zonedDateTime.toEpochSecond
+ val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+ Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+ }
+
override def parse(s: String): Long = {
try {
val parsed = formatter.parse(s)
- val parsedZoneId = parsed.query(TemporalQueries.zone())
- val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
- val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
- val epochSeconds = zonedDateTime.toEpochSecond
- val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
-
- Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+ extractMicros(parsed)
} catch checkParsedDiff(s, legacyFormatter.parse)
}
+ override def parseWithoutTimeZoneOptional(s: String, allowTimeZone: Boolean): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+ if (parsed != null) {
+ val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
+ Some(DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime)))
Review Comment:
```suggestion
Some(extractMicrosNTZ(s, parsed, allowTimeZone))
```
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,63 @@ class Iso8601TimestampFormatter(
protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
pattern, zoneId, locale, legacyFormat)
+ override def parseOptional(s: String): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+ if (parsed != null) {
+ Some(extractMicros(parsed))
+ } else {
+ None
+ }
+ } catch {
+ case NonFatal(_) => None
+ }
+ }
+
+ private def extractMicros(parsed: TemporalAccessor): Long = {
+ val parsedZoneId = parsed.query(TemporalQueries.zone())
+ val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+ val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+ val epochSeconds = zonedDateTime.toEpochSecond
+ val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+ Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+ }
+
override def parse(s: String): Long = {
try {
val parsed = formatter.parse(s)
- val parsedZoneId = parsed.query(TemporalQueries.zone())
- val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
- val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
- val epochSeconds = zonedDateTime.toEpochSecond
- val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
-
- Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+ extractMicros(parsed)
} catch checkParsedDiff(s, legacyFormatter.parse)
}
+ override def parseWithoutTimeZoneOptional(s: String, allowTimeZone: Boolean): Option[Long] = {
+ try {
+ val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+ if (parsed != null) {
+ val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
+ Some(DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime)))
+ } else {
+ None
+ }
+ } catch {
+ case NonFatal(_) => None
+ }
+ }
+
+ private def extractDateAndTime(s: String, parsed: TemporalAccessor, allowTimeZone: Boolean):
+ (LocalDate, LocalTime) = {
+ if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
+ throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
+ }
+ val localDate = toLocalDate(parsed)
+ val localTime = toLocalTime(parsed)
+ (localDate, localTime)
+ }
+
override def parseWithoutTimeZone(s: String, allowTimeZone: Boolean): Long = {
try {
val parsed = formatter.parse(s)
- if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
- throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
- }
- val localDate = toLocalDate(parsed)
- val localTime = toLocalTime(parsed)
+ val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime))
Review Comment:
Let's move the common line to `extractMicrosNTZ()`:
```scala
extractMicrosNTZ(s, parsed, allowTimeZone)
```
##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala:
##########
@@ -472,4 +472,24 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite {
assert(
formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
}
+
+ test("SPARK-39280: support returning optional parse results in the iso8601 formatter") {
+ val formatter = new Iso8601TimestampFormatter(
+ "yyyy-MM-dd HH:mm:ss.SSSS",
+ locale = DateFormatter.defaultLocale,
+ legacyFormat = LegacyDateFormats.SIMPLE_DATE_FORMAT,
+ isParsing = true, zoneId = DateTimeTestUtils.LA)
+ assert(formatter.parseOptional("9999-12-31 23:59:59.9990").contains(253402329599999000L))
+ assert(
+ formatter.parseWithoutTimeZoneOptional("9999-12-31 23:59:59.9990", false)
+ .contains(253402300799999000L))
+ assert(formatter.parseOptional("abc").isEmpty)
+ assert(
+ formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
+
+ assert(formatter.parseOptional("2012-00-65 23:59:59.9990").isEmpty)
+ assert(
+ formatter.parseWithoutTimeZoneOptional("2012-00-65 23:59:59.9990", false)
+ .isEmpty)
Review Comment:
```suggestion
assert(formatter.parseWithoutTimeZoneOptional("2012-00-65 23:59:59.9990", false).isEmpty)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X closed pull request #41078: [SPARK-39280][SQL] Fasten Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X closed pull request #41078: [SPARK-39280][SQL] Fasten Timestamp type inference with user-provided format in JSON/CSV data source
URL: https://github.com/apache/spark/pull/41078
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1537447586
> Can you fill out the JIRA, and add a little bit of explanation here?
Done
> It seems like you just avoid trying to extract more fields if basic parsing fails?
Another formatter parsing method is used to prevent the formatter from throwing exceptions, which is similar to https://github.com/apache/spark/pull/36562 .
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1191261080
##########
sql/core/benchmarks/JsonBenchmark-results.txt:
##########
@@ -3,121 +3,121 @@ Benchmark for performance of JSON parsing
================================================================================================
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
JSON schema inferring: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 2973 3233 291 1.7 594.7 1.0X
-UTF-8 is set 4375 4796 430 1.1 874.9 0.7X
+No encoding 3871 3914 69 1.3 774.2 1.0X
+UTF-8 is set 5539 5563 26 0.9 1107.8 0.7X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
count a short column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 2359 2404 39 2.1 471.8 1.0X
-UTF-8 is set 3814 3885 101 1.3 762.8 0.6X
+No encoding 2984 2999 24 1.7 596.9 1.0X
+UTF-8 is set 4875 4928 46 1.0 975.0 0.6X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 4630 4969 347 0.2 4630.4 1.0X
-UTF-8 is set 8963 9040 82 0.1 8963.4 0.5X
+No encoding 6353 6446 143 0.2 6353.4 1.0X
+UTF-8 is set 10548 10647 163 0.1 10547.8 0.6X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
select wide row: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-No encoding 15252 15481 329 0.0 305030.9 1.0X
-UTF-8 is set 16349 16961 627 0.0 326988.8 0.9X
+No encoding 18807 18880 66 0.0 376130.9 1.0X
+UTF-8 is set 20530 20554 23 0.0 410593.2 0.9X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Select a subset of 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Select 10 columns 2290 2296 6 0.4 2289.6 1.0X
-Select 1 column 1636 1652 15 0.6 1635.6 1.4X
+Select 10 columns 2741 2749 12 0.4 2740.6 1.0X
+Select 1 column 1916 1925 8 0.5 1916.5 1.4X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
creation of JSON parser per line: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Short column without encoding 661 673 12 1.5 661.1 1.0X
-Short column with UTF-8 950 978 26 1.1 950.1 0.7X
-Wide column without encoding 11106 11297 179 0.1 11106.4 0.1X
-Wide column with UTF-8 13743 13762 18 0.1 13743.3 0.0X
+Short column without encoding 901 934 29 1.1 900.8 1.0X
+Short column with UTF-8 1320 1343 31 0.8 1319.9 0.7X
+Wide column without encoding 13446 13544 103 0.1 13445.8 0.1X
+Wide column with UTF-8 17770 17854 76 0.1 17770.0 0.1X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 119 131 15 8.4 119.5 1.0X
-from_json 2475 2493 18 0.4 2474.9 0.0X
-json_tuple 2680 2745 57 0.4 2680.3 0.0X
-get_json_object 2549 2630 88 0.4 2549.3 0.0X
+Text read 159 167 9 6.3 159.2 1.0X
+from_json 2844 2863 25 0.4 2844.1 0.1X
+json_tuple 3137 3161 23 0.3 3136.7 0.1X
+get_json_object 2874 2884 9 0.3 2874.2 0.1X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Dataset of json strings: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 545 567 29 9.2 109.0 1.0X
-schema inferring 2460 2498 42 2.0 492.1 0.2X
-parsing 2618 2656 36 1.9 523.6 0.2X
+Text read 732 745 11 6.8 146.3 1.0X
+schema inferring 3260 3265 6 1.5 652.0 0.2X
+parsing 3592 3645 46 1.4 718.4 0.2X
Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Json files in the per-line mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Text read 884 897 16 5.7 176.8 1.0X
-Schema inferring 3016 3029 21 1.7 603.2 0.3X
-Parsing without charset 3251 3267 14 1.5 650.2 0.3X
-Parsing with UTF-8 4892 5020 118 1.0 978.3 0.2X
+Text read 1092 1100 11 4.6 218.4 1.0X
+Schema inferring 3814 3826 15 1.3 762.8 0.3X
+Parsing without charset 4153 4184 32 1.2 830.7 0.3X
+Parsing with UTF-8 6014 6035 22 0.8 1202.9 0.2X
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Create a dataset of timestamps 163 164 2 6.1 162.6 1.0X
-to_json(timestamp) 1307 1383 92 0.8 1307.4 0.1X
-write timestamps to files 1044 1090 40 1.0 1044.5 0.2X
-Create a dataset of dates 195 207 10 5.1 195.2 0.8X
-to_json(date) 915 934 19 1.1 914.8 0.2X
-write dates to files 717 727 9 1.4 717.3 0.2X
+Create a dataset of timestamps 193 198 4 5.2 193.5 1.0X
+to_json(timestamp) 1566 1582 14 0.6 1566.4 0.1X
+write timestamps to files 1265 1274 14 0.8 1265.1 0.2X
+Create a dataset of dates 232 239 10 4.3 231.9 0.8X
+to_json(date) 1037 1058 18 1.0 1037.2 0.2X
+write dates to files 766 770 7 1.3 765.6 0.3X
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------------------
-read timestamp text from files 270 280 9 3.7 270.4 1.0X
-read timestamps from files 2623 2789 159 0.4 2623.1 0.1X
-infer timestamps from files 6416 7147 703 0.2 6415.7 0.0X
-read date text from files 233 234 1 4.3 233.3 1.2X
-read date from files 948 969 24 1.1 948.2 0.3X
-timestamp strings 335 347 14 3.0 334.9 0.8X
-parse timestamps from Dataset[String] 2961 2993 41 0.3 2960.6 0.1X
-infer timestamps from Dataset[String] 7139 7314 158 0.1 7139.1 0.0X
-date strings 384 397 15 2.6 383.6 0.7X
-parse dates from Dataset[String] 1325 1347 24 0.8 1325.0 0.2X
-from_json(timestamp) 4774 4788 13 0.2 4773.6 0.1X
-from_json(date) 3078 3090 11 0.3 3078.5 0.1X
-infer error timestamps from Dataset[String] with default format 2025 2058 28 0.5 2025.0 0.1X
-infer error timestamps from Dataset[String] with user-provided format 20261 20338 95 0.0 20260.6 0.0X
-infer error timestamps from Dataset[String] with legacy format 5495 5528 38 0.2 5495.4 0.0X
+read timestamp text from files 283 289 6 3.5 283.1 1.0X
+read timestamps from files 3364 3431 60 0.3 3363.6 0.1X
+infer timestamps from files 8913 8935 38 0.1 8912.6 0.0X
+read date text from files 263 267 4 3.8 262.9 1.1X
+read date from files 1102 1116 12 0.9 1101.7 0.3X
+timestamp strings 412 426 14 2.4 412.0 0.7X
+parse timestamps from Dataset[String] 3941 3956 14 0.3 3940.8 0.1X
+infer timestamps from Dataset[String] 9334 9383 43 0.1 9333.8 0.0X
+date strings 469 484 24 2.1 469.3 0.6X
+parse dates from Dataset[String] 1565 1572 11 0.6 1564.8 0.2X
+from_json(timestamp) 5825 5917 88 0.2 5824.5 0.0X
+from_json(date) 3553 3574 19 0.3 3553.1 0.1X
+infer error timestamps from Dataset[String] with default format 2590 2609 19 0.4 2589.9 0.1X
+infer error timestamps from Dataset[String] with user-provided format 2517 2551 30 0.4 2516.8 0.1X
Review Comment:
@MaxGekk The bencnmark updated. `infer error timestamps from Dataset[String] with user-provided format` speed already up.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org