You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "Hisoka-X (via GitHub)" <gi...@apache.org> on 2023/05/07 06:07:20 UTC

[GitHub] [spark] Hisoka-X opened a new pull request, #41078: [SPARK-39280][SQL] Fasten Timestamp type inference with user-provided format in JSON/CSV data source

Hisoka-X opened a new pull request, #41078:
URL: https://github.com/apache/spark/pull/41078

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   Follow up #36562 , performance improvement when Timestamp type inference with user-provided format.
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   Performance improvement when Timestamp type inference with user-provided format.
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   Add new test
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192110202


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))

Review Comment:
   It will be check in https://github.com/apache/spark/pull/41078/files/77c78b1f0ed3f7ee3c301eec2fa543a048c35531#diff-b42bcba727feeebf78f0e5540f2d4f6c6a38afd2225e4ebeae22a604e42eb094R183-R188
   . The logic can't go wrong. But in this situation, the speed up will not work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1537595353

   cc @sadikovi too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk closed pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk closed pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
URL: https://github.com/apache/spark/pull/41078


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192102884


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))

Review Comment:
   `parseUnresolved()` doesn't validate correctness of result, see its docs:
   ```
   The result of this method is TemporalAccessor which represents the data as seen in the input. Values are not validated, thus parsing a date string of '2012-00-65' would result in a temporal with three fields - year of '2012', month of '0' and day-of-month of '65'.
   ```
   
   I think you should check the result yourself somewhere.
   
   Could you re-check the example from the docs: '2012-00-65' and add a test, please.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1546657983

   @MaxGekk I adjusted the code by your suggestion. By the way, maybe we should use spotless to avoid problem like this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1541990704

   The same question as for https://github.com/apache/spark/pull/41091:
   
   _Do the benchmarks `CSVBenchmark` and `JsonBenchmark` show any improvements? Could you regenerate the results `JsonBenchmark.*.txt` and `CSVBenchmark.*.txt`, please._


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1546829555

   Thanks @MaxGekk for your patience and @srowen @HyukjinKwon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1537441000

   Can you fill out the JIRA, and add a little bit of explanation here?
   It seems like you just avoid trying to extract more fields if basic parsing fails?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192102884


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))

Review Comment:
   `parseUnresolved()` doesn't validate correctness of its result, see its docs:
   ```
   The result of this method is TemporalAccessor which represents the data as seen in the input. Values are not validated, thus parsing a date string of '2012-00-65' would result in a temporal with three fields - year of '2012', month of '0' and day-of-month of '65'.
   ```
   
   I think you should check the result yourself somewhere.
   
   Could you re-check the example from the docs: '2012-00-65' and add a test, please.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192178990


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))

Review Comment:
   > Could you re-check the example from the docs: '2012-00-65' and add a test, please.
   
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1545815468

   > @Hisoka-X Could you highlight in PR description how much does it become faster. Please, put some numbers to the section `Why are the changes needed?`.
   
   Done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192342220


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+        Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))

Review Comment:
   Could you move the common part:
   ```scala
   Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
   ```
   to `extractSeconds()`



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+        Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractSeconds(parsed: TemporalAccessor): (Long, Long) = {
+    val parsedZoneId = parsed.query(TemporalQueries.zone())
+    val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+    val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+    val epochSeconds = zonedDateTime.toEpochSecond
+    val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+    (epochSeconds, microsOfSecond)
+  }

Review Comment:
   ```suggestion
     private def extractMicros(parsed: TemporalAccessor): Long = {
       val parsedZoneId = parsed.query(TemporalQueries.zone())
       val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
       val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
       val epochSeconds = zonedDateTime.toEpochSecond
       val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
       Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1546828727

   +1, LGTM. Merging to master.
   Thank you, @Hisoka-X and @srowen @HyukjinKwon for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192986251


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,63 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        Some(extractMicros(parsed))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractMicros(parsed: TemporalAccessor): Long = {
+    val parsedZoneId = parsed.query(TemporalQueries.zone())
+    val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+    val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+    val epochSeconds = zonedDateTime.toEpochSecond
+    val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+    Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+  }
+
   override def parse(s: String): Long = {
     try {
       val parsed = formatter.parse(s)
-      val parsedZoneId = parsed.query(TemporalQueries.zone())
-      val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
-      val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
-      val epochSeconds = zonedDateTime.toEpochSecond
-      val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
-
-      Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+      extractMicros(parsed)
     } catch checkParsedDiff(s, legacyFormatter.parse)
   }
 
+  override def parseWithoutTimeZoneOptional(s: String, allowTimeZone: Boolean): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
+        Some(DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime)))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractDateAndTime(s: String, parsed: TemporalAccessor, allowTimeZone: Boolean):
+  (LocalDate, LocalTime) = {
+    if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
+      throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
+    }
+    val localDate = toLocalDate(parsed)
+    val localTime = toLocalTime(parsed)
+    (localDate, localTime)
+  }

Review Comment:
   Thanks for remind, I missed this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1545800041

   @Hisoka-X Could you highlight in PR description how much does it become faster. Please, put some numbers to the section `Why are the changes needed?`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1186872738


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+        Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None

Review Comment:
   Yes, we want to avoiding exceptions like `DateTimeParseException`, it cause by field value can't be parsed by format. There only catch JVM error like `VirtualMachineError`, same opeartion with `DateTimeUtils.stringToTimestamp`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1186870024


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+        Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None

Review Comment:
   You're saying you're avoiding exceptions, but then this catches exceptions. Is this just the fall-back case for unexpected errors?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Fasten Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1537302656

   cc @gengliangwang @HyukjinKwon @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192410333


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+        Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractSeconds(parsed: TemporalAccessor): (Long, Long) = {
+    val parsedZoneId = parsed.query(TemporalQueries.zone())
+    val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+    val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+    val epochSeconds = zonedDateTime.toEpochSecond
+    val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+    (epochSeconds, microsOfSecond)
+  }

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192941467


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,63 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        Some(extractMicros(parsed))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractMicros(parsed: TemporalAccessor): Long = {
+    val parsedZoneId = parsed.query(TemporalQueries.zone())
+    val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+    val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+    val epochSeconds = zonedDateTime.toEpochSecond
+    val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+    Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+  }
+
   override def parse(s: String): Long = {
     try {
       val parsed = formatter.parse(s)
-      val parsedZoneId = parsed.query(TemporalQueries.zone())
-      val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
-      val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
-      val epochSeconds = zonedDateTime.toEpochSecond
-      val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
-
-      Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+      extractMicros(parsed)
     } catch checkParsedDiff(s, legacyFormatter.parse)
   }
 
+  override def parseWithoutTimeZoneOptional(s: String, allowTimeZone: Boolean): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
+        Some(DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime)))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractDateAndTime(s: String, parsed: TemporalAccessor, allowTimeZone: Boolean):
+  (LocalDate, LocalTime) = {
+    if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
+      throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
+    }
+    val localDate = toLocalDate(parsed)
+    val localTime = toLocalTime(parsed)
+    (localDate, localTime)
+  }

Review Comment:
   Let's deduplicate the code, and put one more common line here:
   ```suggestion
     private def extractMicrosNTZ(
         s: String,
         parsed: TemporalAccessor,
         allowTimeZone: Boolean): Long = {
       if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
         throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
       }
       val localDate = toLocalDate(parsed)
       val localTime = toLocalTime(parsed)
       DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime))
     }
   ```



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala:
##########
@@ -472,4 +472,24 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite {
     assert(
       formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
   }
+
+  test("SPARK-39280: support returning optional parse results in the iso8601 formatter") {
+    val formatter = new Iso8601TimestampFormatter(
+      "yyyy-MM-dd HH:mm:ss.SSSS",
+      locale = DateFormatter.defaultLocale,
+      legacyFormat = LegacyDateFormats.SIMPLE_DATE_FORMAT,
+      isParsing = true, zoneId = DateTimeTestUtils.LA)
+    assert(formatter.parseOptional("9999-12-31 23:59:59.9990").contains(253402329599999000L))
+    assert(
+      formatter.parseWithoutTimeZoneOptional("9999-12-31 23:59:59.9990", false)
+        .contains(253402300799999000L))

Review Comment:
   ```suggestion
       assert(formatter.parseWithoutTimeZoneOptional("9999-12-31 23:59:59.9990", false)
         .contains(253402300799999000L))
   ```



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala:
##########
@@ -472,4 +472,24 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite {
     assert(
       formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
   }
+
+  test("SPARK-39280: support returning optional parse results in the iso8601 formatter") {
+    val formatter = new Iso8601TimestampFormatter(
+      "yyyy-MM-dd HH:mm:ss.SSSS",
+      locale = DateFormatter.defaultLocale,
+      legacyFormat = LegacyDateFormats.SIMPLE_DATE_FORMAT,
+      isParsing = true, zoneId = DateTimeTestUtils.LA)
+    assert(formatter.parseOptional("9999-12-31 23:59:59.9990").contains(253402329599999000L))
+    assert(
+      formatter.parseWithoutTimeZoneOptional("9999-12-31 23:59:59.9990", false)
+        .contains(253402300799999000L))
+    assert(formatter.parseOptional("abc").isEmpty)
+    assert(
+      formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)

Review Comment:
   ```suggestion
       assert(formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,63 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        Some(extractMicros(parsed))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractMicros(parsed: TemporalAccessor): Long = {
+    val parsedZoneId = parsed.query(TemporalQueries.zone())
+    val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+    val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+    val epochSeconds = zonedDateTime.toEpochSecond
+    val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+    Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+  }
+
   override def parse(s: String): Long = {
     try {
       val parsed = formatter.parse(s)
-      val parsedZoneId = parsed.query(TemporalQueries.zone())
-      val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
-      val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
-      val epochSeconds = zonedDateTime.toEpochSecond
-      val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
-
-      Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+      extractMicros(parsed)
     } catch checkParsedDiff(s, legacyFormatter.parse)
   }
 
+  override def parseWithoutTimeZoneOptional(s: String, allowTimeZone: Boolean): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
+        Some(DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime)))

Review Comment:
   ```suggestion
           Some(extractMicrosNTZ(s, parsed, allowTimeZone))
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,63 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        Some(extractMicros(parsed))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractMicros(parsed: TemporalAccessor): Long = {
+    val parsedZoneId = parsed.query(TemporalQueries.zone())
+    val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+    val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+    val epochSeconds = zonedDateTime.toEpochSecond
+    val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+    Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+  }
+
   override def parse(s: String): Long = {
     try {
       val parsed = formatter.parse(s)
-      val parsedZoneId = parsed.query(TemporalQueries.zone())
-      val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
-      val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
-      val epochSeconds = zonedDateTime.toEpochSecond
-      val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
-
-      Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+      extractMicros(parsed)
     } catch checkParsedDiff(s, legacyFormatter.parse)
   }
 
+  override def parseWithoutTimeZoneOptional(s: String, allowTimeZone: Boolean): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
+        Some(DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime)))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractDateAndTime(s: String, parsed: TemporalAccessor, allowTimeZone: Boolean):
+  (LocalDate, LocalTime) = {
+    if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
+      throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
+    }
+    val localDate = toLocalDate(parsed)
+    val localTime = toLocalTime(parsed)
+    (localDate, localTime)
+  }
+
   override def parseWithoutTimeZone(s: String, allowTimeZone: Boolean): Long = {
     try {
       val parsed = formatter.parse(s)
-      if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
-        throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
-      }
-      val localDate = toLocalDate(parsed)
-      val localTime = toLocalTime(parsed)
+      val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
       DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime))

Review Comment:
   Let's move the common line to `extractMicrosNTZ()`:
   ```scala
         extractMicrosNTZ(s, parsed, allowTimeZone)
   ```



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala:
##########
@@ -472,4 +472,24 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite {
     assert(
       formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
   }
+
+  test("SPARK-39280: support returning optional parse results in the iso8601 formatter") {
+    val formatter = new Iso8601TimestampFormatter(
+      "yyyy-MM-dd HH:mm:ss.SSSS",
+      locale = DateFormatter.defaultLocale,
+      legacyFormat = LegacyDateFormats.SIMPLE_DATE_FORMAT,
+      isParsing = true, zoneId = DateTimeTestUtils.LA)
+    assert(formatter.parseOptional("9999-12-31 23:59:59.9990").contains(253402329599999000L))
+    assert(
+      formatter.parseWithoutTimeZoneOptional("9999-12-31 23:59:59.9990", false)
+        .contains(253402300799999000L))
+    assert(formatter.parseOptional("abc").isEmpty)
+    assert(
+      formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
+
+    assert(formatter.parseOptional("2012-00-65 23:59:59.9990").isEmpty)
+    assert(
+      formatter.parseWithoutTimeZoneOptional("2012-00-65 23:59:59.9990", false)
+        .isEmpty)

Review Comment:
   ```suggestion
       assert(formatter.parseWithoutTimeZoneOptional("2012-00-65 23:59:59.9990", false).isEmpty)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X closed pull request #41078: [SPARK-39280][SQL] Fasten Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X closed pull request #41078: [SPARK-39280][SQL] Fasten Timestamp type inference with user-provided format in JSON/CSV data source
URL: https://github.com/apache/spark/pull/41078


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1537447586

   > Can you fill out the JIRA, and add a little bit of explanation here? 
   
   Done
   
   > It seems like you just avoid trying to extract more fields if basic parsing fails?
   
   Another formatter parsing method is used to prevent the formatter from throwing exceptions, which is similar to https://github.com/apache/spark/pull/36562 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.
Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1191261080


##########
sql/core/benchmarks/JsonBenchmark-results.txt:
##########
@@ -3,121 +3,121 @@ Benchmark for performance of JSON parsing
 ================================================================================================
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 JSON schema inferring:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        2973           3233         291          1.7         594.7       1.0X
-UTF-8 is set                                       4375           4796         430          1.1         874.9       0.7X
+No encoding                                        3871           3914          69          1.3         774.2       1.0X
+UTF-8 is set                                       5539           5563          26          0.9        1107.8       0.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 count a short column:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        2359           2404          39          2.1         471.8       1.0X
-UTF-8 is set                                       3814           3885         101          1.3         762.8       0.6X
+No encoding                                        2984           2999          24          1.7         596.9       1.0X
+UTF-8 is set                                       4875           4928          46          1.0         975.0       0.6X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 count a wide column:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        4630           4969         347          0.2        4630.4       1.0X
-UTF-8 is set                                       8963           9040          82          0.1        8963.4       0.5X
+No encoding                                        6353           6446         143          0.2        6353.4       1.0X
+UTF-8 is set                                      10548          10647         163          0.1       10547.8       0.6X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 select wide row:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                       15252          15481         329          0.0      305030.9       1.0X
-UTF-8 is set                                      16349          16961         627          0.0      326988.8       0.9X
+No encoding                                       18807          18880          66          0.0      376130.9       1.0X
+UTF-8 is set                                      20530          20554          23          0.0      410593.2       0.9X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 Select a subset of 10 columns:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Select 10 columns                                  2290           2296           6          0.4        2289.6       1.0X
-Select 1 column                                    1636           1652          15          0.6        1635.6       1.4X
+Select 10 columns                                  2741           2749          12          0.4        2740.6       1.0X
+Select 1 column                                    1916           1925           8          0.5        1916.5       1.4X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 creation of JSON parser per line:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Short column without encoding                       661            673          12          1.5         661.1       1.0X
-Short column with UTF-8                             950            978          26          1.1         950.1       0.7X
-Wide column without encoding                      11106          11297         179          0.1       11106.4       0.1X
-Wide column with UTF-8                            13743          13762          18          0.1       13743.3       0.0X
+Short column without encoding                       901            934          29          1.1         900.8       1.0X
+Short column with UTF-8                            1320           1343          31          0.8        1319.9       0.7X
+Wide column without encoding                      13446          13544         103          0.1       13445.8       0.1X
+Wide column with UTF-8                            17770          17854          76          0.1       17770.0       0.1X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                           119            131          15          8.4         119.5       1.0X
-from_json                                          2475           2493          18          0.4        2474.9       0.0X
-json_tuple                                         2680           2745          57          0.4        2680.3       0.0X
-get_json_object                                    2549           2630          88          0.4        2549.3       0.0X
+Text read                                           159            167           9          6.3         159.2       1.0X
+from_json                                          2844           2863          25          0.4        2844.1       0.1X
+json_tuple                                         3137           3161          23          0.3        3136.7       0.1X
+get_json_object                                    2874           2884           9          0.3        2874.2       0.1X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 Dataset of json strings:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                           545            567          29          9.2         109.0       1.0X
-schema inferring                                   2460           2498          42          2.0         492.1       0.2X
-parsing                                            2618           2656          36          1.9         523.6       0.2X
+Text read                                           732            745          11          6.8         146.3       1.0X
+schema inferring                                   3260           3265           6          1.5         652.0       0.2X
+parsing                                            3592           3645          46          1.4         718.4       0.2X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 Json files in the per-line mode:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                           884            897          16          5.7         176.8       1.0X
-Schema inferring                                   3016           3029          21          1.7         603.2       0.3X
-Parsing without charset                            3251           3267          14          1.5         650.2       0.3X
-Parsing with UTF-8                                 4892           5020         118          1.0         978.3       0.2X
+Text read                                          1092           1100          11          4.6         218.4       1.0X
+Schema inferring                                   3814           3826          15          1.3         762.8       0.3X
+Parsing without charset                            4153           4184          32          1.2         830.7       0.3X
+Parsing with UTF-8                                 6014           6035          22          0.8        1202.9       0.2X
 
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 Write dates and timestamps:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Create a dataset of timestamps                      163            164           2          6.1         162.6       1.0X
-to_json(timestamp)                                 1307           1383          92          0.8        1307.4       0.1X
-write timestamps to files                          1044           1090          40          1.0        1044.5       0.2X
-Create a dataset of dates                           195            207          10          5.1         195.2       0.8X
-to_json(date)                                       915            934          19          1.1         914.8       0.2X
-write dates to files                                717            727           9          1.4         717.3       0.2X
+Create a dataset of timestamps                      193            198           4          5.2         193.5       1.0X
+to_json(timestamp)                                 1566           1582          14          0.6        1566.4       0.1X
+write timestamps to files                          1265           1274          14          0.8        1265.1       0.2X
+Create a dataset of dates                           232            239          10          4.3         231.9       0.8X
+to_json(date)                                      1037           1058          18          1.0        1037.2       0.2X
+write dates to files                                766            770           7          1.3         765.6       0.3X
 
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 Read dates and timestamps:                                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -----------------------------------------------------------------------------------------------------------------------------------------------------
-read timestamp text from files                                                   270            280           9          3.7         270.4       1.0X
-read timestamps from files                                                      2623           2789         159          0.4        2623.1       0.1X
-infer timestamps from files                                                     6416           7147         703          0.2        6415.7       0.0X
-read date text from files                                                        233            234           1          4.3         233.3       1.2X
-read date from files                                                             948            969          24          1.1         948.2       0.3X
-timestamp strings                                                                335            347          14          3.0         334.9       0.8X
-parse timestamps from Dataset[String]                                           2961           2993          41          0.3        2960.6       0.1X
-infer timestamps from Dataset[String]                                           7139           7314         158          0.1        7139.1       0.0X
-date strings                                                                     384            397          15          2.6         383.6       0.7X
-parse dates from Dataset[String]                                                1325           1347          24          0.8        1325.0       0.2X
-from_json(timestamp)                                                            4774           4788          13          0.2        4773.6       0.1X
-from_json(date)                                                                 3078           3090          11          0.3        3078.5       0.1X
-infer error timestamps from Dataset[String] with default format                 2025           2058          28          0.5        2025.0       0.1X
-infer error timestamps from Dataset[String] with user-provided format          20261          20338          95          0.0       20260.6       0.0X
-infer error timestamps from Dataset[String] with legacy format                  5495           5528          38          0.2        5495.4       0.0X
+read timestamp text from files                                                   283            289           6          3.5         283.1       1.0X
+read timestamps from files                                                      3364           3431          60          0.3        3363.6       0.1X
+infer timestamps from files                                                     8913           8935          38          0.1        8912.6       0.0X
+read date text from files                                                        263            267           4          3.8         262.9       1.1X
+read date from files                                                            1102           1116          12          0.9        1101.7       0.3X
+timestamp strings                                                                412            426          14          2.4         412.0       0.7X
+parse timestamps from Dataset[String]                                           3941           3956          14          0.3        3940.8       0.1X
+infer timestamps from Dataset[String]                                           9334           9383          43          0.1        9333.8       0.0X
+date strings                                                                     469            484          24          2.1         469.3       0.6X
+parse dates from Dataset[String]                                                1565           1572          11          0.6        1564.8       0.2X
+from_json(timestamp)                                                            5825           5917          88          0.2        5824.5       0.0X
+from_json(date)                                                                 3553           3574          19          0.3        3553.1       0.1X
+infer error timestamps from Dataset[String] with default format                 2590           2609          19          0.4        2589.9       0.1X
+infer error timestamps from Dataset[String] with user-provided format           2517           2551          30          0.4        2516.8       0.1X

Review Comment:
   @MaxGekk The bencnmark updated. `infer error timestamps from Dataset[String] with user-provided format` speed already up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org