You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "Hisoka-X (via GitHub)" <gi...@apache.org> on 2023/05/07 06:07:20 UTC

[GitHub] [spark] Hisoka-X opened a new pull request, #41078: [SPARK-39280][SQL] Fasten Timestamp type inference with user-provided format in JSON/CSV data source

Hisoka-X opened a new pull request, #41078:
URL: https://github.com/apache/spark/pull/41078

### What changes were proposed in this pull request?
Follow up #36562 , performance improvement when Timestamp type inference with user-provided format.

### Why are the changes needed?
Performance improvement when Timestamp type inference with user-provided format.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new test

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.

Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192110202


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))

Review Comment:
   It will be check in https://github.com/apache/spark/pull/41078/files/77c78b1f0ed3f7ee3c301eec2fa543a048c35531#diff-b42bcba727feeebf78f0e5540f2d4f6c6a38afd2225e4ebeae22a604e42eb094R183-R188
   . The logic can't go wrong. But in this situation, the speed up will not work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1537595353

   cc @sadikovi too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk closed pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source
URL: https://github.com/apache/spark/pull/41078


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192102884


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))

Review Comment:
   `parseUnresolved()` doesn't validate correctness of result, see its docs:
   ```
   The result of this method is TemporalAccessor which represents the data as seen in the input. Values are not validated, thus parsing a date string of '2012-00-65' would result in a temporal with three fields - year of '2012', month of '0' and day-of-month of '65'.
   ```
   
   I think you should check the result yourself somewhere.
   
   Could you re-check the example from the docs: '2012-00-65' and add a test, please.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.

Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1546657983

   @MaxGekk I adjusted the code by your suggestion. By the way, maybe we should use spotless to avoid problem like this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1541990704

   The same question as for https://github.com/apache/spark/pull/41091:
   
   _Do the benchmarks `CSVBenchmark` and `JsonBenchmark` show any improvements? Could you regenerate the results `JsonBenchmark.*.txt` and `CSVBenchmark.*.txt`, please._


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.

Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1546829555

   Thanks @MaxGekk for your patience and @srowen @HyukjinKwon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1537441000

   Can you fill out the JIRA, and add a little bit of explanation here?
   It seems like you just avoid trying to extract more fields if basic parsing fails?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192102884


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))

Review Comment:
   `parseUnresolved()` doesn't validate correctness of its result, see its docs:
   ```
   The result of this method is TemporalAccessor which represents the data as seen in the input. Values are not validated, thus parsing a date string of '2012-00-65' would result in a temporal with three fields - year of '2012', month of '0' and day-of-month of '65'.
   ```
   
   I think you should check the result yourself somewhere.
   
   Could you re-check the example from the docs: '2012-00-65' and add a test, please.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.

Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192178990


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))

Review Comment:
   > Could you re-check the example from the docs: '2012-00-65' and add a test, please.
   
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.

Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1545815468

   > @Hisoka-X Could you highlight in PR description how much does it become faster. Please, put some numbers to the section `Why are the changes needed?`.
   
   Done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192342220


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+        Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))

Review Comment:
   Could you move the common part:
   ```scala
   Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
   ```
   to `extractSeconds()`



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+        Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractSeconds(parsed: TemporalAccessor): (Long, Long) = {
+    val parsedZoneId = parsed.query(TemporalQueries.zone())
+    val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+    val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+    val epochSeconds = zonedDateTime.toEpochSecond
+    val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+    (epochSeconds, microsOfSecond)
+  }

Review Comment:
   ```suggestion
     private def extractMicros(parsed: TemporalAccessor): Long = {
       val parsedZoneId = parsed.query(TemporalQueries.zone())
       val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
       val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
       val epochSeconds = zonedDateTime.toEpochSecond
       val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
       Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1546828727

   +1, LGTM. Merging to master.
   Thank you, @Hisoka-X and @srowen @HyukjinKwon for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.

Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192986251


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,63 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        Some(extractMicros(parsed))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractMicros(parsed: TemporalAccessor): Long = {
+    val parsedZoneId = parsed.query(TemporalQueries.zone())
+    val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+    val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+    val epochSeconds = zonedDateTime.toEpochSecond
+    val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+    Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+  }
+
   override def parse(s: String): Long = {
     try {
       val parsed = formatter.parse(s)
-      val parsedZoneId = parsed.query(TemporalQueries.zone())
-      val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
-      val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
-      val epochSeconds = zonedDateTime.toEpochSecond
-      val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
-
-      Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+      extractMicros(parsed)
     } catch checkParsedDiff(s, legacyFormatter.parse)
   }
 
+  override def parseWithoutTimeZoneOptional(s: String, allowTimeZone: Boolean): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
+        Some(DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime)))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractDateAndTime(s: String, parsed: TemporalAccessor, allowTimeZone: Boolean):
+  (LocalDate, LocalTime) = {
+    if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
+      throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
+    }
+    val localDate = toLocalDate(parsed)
+    val localTime = toLocalTime(parsed)
+    (localDate, localTime)
+  }

Review Comment:
   Thanks for remind, I missed this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1545800041

   @Hisoka-X Could you highlight in PR description how much does it become faster. Please, put some numbers to the section `Why are the changes needed?`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.

Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1186872738


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+        Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None

Review Comment:
   Yes, we want to avoiding exceptions like `DateTimeParseException`, it cause by field value can't be parsed by format. There only catch JVM error like `VirtualMachineError`, same opeartion with `DateTimeUtils.stringToTimestamp`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1186870024


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+        Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None

Review Comment:
   You're saying you're avoiding exceptions, but then this catches exceptions. Is this just the fall-back case for unexpected errors?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Fasten Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.

Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1537302656

   cc @gengliangwang @HyukjinKwon @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.

Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192410333


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,66 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (epochSeconds, microsOfSecond) = extractSeconds(parsed)
+        Some(Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractSeconds(parsed: TemporalAccessor): (Long, Long) = {
+    val parsedZoneId = parsed.query(TemporalQueries.zone())
+    val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+    val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+    val epochSeconds = zonedDateTime.toEpochSecond
+    val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+    (epochSeconds, microsOfSecond)
+  }

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1192941467


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,63 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        Some(extractMicros(parsed))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractMicros(parsed: TemporalAccessor): Long = {
+    val parsedZoneId = parsed.query(TemporalQueries.zone())
+    val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+    val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+    val epochSeconds = zonedDateTime.toEpochSecond
+    val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+    Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+  }
+
   override def parse(s: String): Long = {
     try {
       val parsed = formatter.parse(s)
-      val parsedZoneId = parsed.query(TemporalQueries.zone())
-      val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
-      val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
-      val epochSeconds = zonedDateTime.toEpochSecond
-      val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
-
-      Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+      extractMicros(parsed)
     } catch checkParsedDiff(s, legacyFormatter.parse)
   }
 
+  override def parseWithoutTimeZoneOptional(s: String, allowTimeZone: Boolean): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
+        Some(DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime)))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractDateAndTime(s: String, parsed: TemporalAccessor, allowTimeZone: Boolean):
+  (LocalDate, LocalTime) = {
+    if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
+      throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
+    }
+    val localDate = toLocalDate(parsed)
+    val localTime = toLocalTime(parsed)
+    (localDate, localTime)
+  }

Review Comment:
   Let's deduplicate the code, and put one more common line here:
   ```suggestion
     private def extractMicrosNTZ(
         s: String,
         parsed: TemporalAccessor,
         allowTimeZone: Boolean): Long = {
       if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
         throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
       }
       val localDate = toLocalDate(parsed)
       val localTime = toLocalTime(parsed)
       DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime))
     }
   ```



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala:
##########
@@ -472,4 +472,24 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite {
     assert(
       formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
   }
+
+  test("SPARK-39280: support returning optional parse results in the iso8601 formatter") {
+    val formatter = new Iso8601TimestampFormatter(
+      "yyyy-MM-dd HH:mm:ss.SSSS",
+      locale = DateFormatter.defaultLocale,
+      legacyFormat = LegacyDateFormats.SIMPLE_DATE_FORMAT,
+      isParsing = true, zoneId = DateTimeTestUtils.LA)
+    assert(formatter.parseOptional("9999-12-31 23:59:59.9990").contains(253402329599999000L))
+    assert(
+      formatter.parseWithoutTimeZoneOptional("9999-12-31 23:59:59.9990", false)
+        .contains(253402300799999000L))

Review Comment:
   ```suggestion
       assert(formatter.parseWithoutTimeZoneOptional("9999-12-31 23:59:59.9990", false)
         .contains(253402300799999000L))
   ```



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala:
##########
@@ -472,4 +472,24 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite {
     assert(
       formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
   }
+
+  test("SPARK-39280: support returning optional parse results in the iso8601 formatter") {
+    val formatter = new Iso8601TimestampFormatter(
+      "yyyy-MM-dd HH:mm:ss.SSSS",
+      locale = DateFormatter.defaultLocale,
+      legacyFormat = LegacyDateFormats.SIMPLE_DATE_FORMAT,
+      isParsing = true, zoneId = DateTimeTestUtils.LA)
+    assert(formatter.parseOptional("9999-12-31 23:59:59.9990").contains(253402329599999000L))
+    assert(
+      formatter.parseWithoutTimeZoneOptional("9999-12-31 23:59:59.9990", false)
+        .contains(253402300799999000L))
+    assert(formatter.parseOptional("abc").isEmpty)
+    assert(
+      formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)

Review Comment:
   ```suggestion
       assert(formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,63 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        Some(extractMicros(parsed))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractMicros(parsed: TemporalAccessor): Long = {
+    val parsedZoneId = parsed.query(TemporalQueries.zone())
+    val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+    val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+    val epochSeconds = zonedDateTime.toEpochSecond
+    val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+    Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+  }
+
   override def parse(s: String): Long = {
     try {
       val parsed = formatter.parse(s)
-      val parsedZoneId = parsed.query(TemporalQueries.zone())
-      val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
-      val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
-      val epochSeconds = zonedDateTime.toEpochSecond
-      val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
-
-      Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+      extractMicros(parsed)
     } catch checkParsedDiff(s, legacyFormatter.parse)
   }
 
+  override def parseWithoutTimeZoneOptional(s: String, allowTimeZone: Boolean): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
+        Some(DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime)))

Review Comment:
   ```suggestion
           Some(extractMicrosNTZ(s, parsed, allowTimeZone))
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##########
@@ -163,27 +165,63 @@ class Iso8601TimestampFormatter(
   protected lazy val legacyFormatter = TimestampFormatter.getLegacyFormatter(
     pattern, zoneId, locale, legacyFormat)
 
+  override def parseOptional(s: String): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        Some(extractMicros(parsed))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractMicros(parsed: TemporalAccessor): Long = {
+    val parsedZoneId = parsed.query(TemporalQueries.zone())
+    val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
+    val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
+    val epochSeconds = zonedDateTime.toEpochSecond
+    val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
+    Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+  }
+
   override def parse(s: String): Long = {
     try {
       val parsed = formatter.parse(s)
-      val parsedZoneId = parsed.query(TemporalQueries.zone())
-      val timeZoneId = if (parsedZoneId == null) zoneId else parsedZoneId
-      val zonedDateTime = toZonedDateTime(parsed, timeZoneId)
-      val epochSeconds = zonedDateTime.toEpochSecond
-      val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)
-
-      Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
+      extractMicros(parsed)
     } catch checkParsedDiff(s, legacyFormatter.parse)
   }
 
+  override def parseWithoutTimeZoneOptional(s: String, allowTimeZone: Boolean): Option[Long] = {
+    try {
+      val parsed = formatter.parseUnresolved(s, new ParsePosition(0))
+      if (parsed != null) {
+        val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
+        Some(DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime)))
+      } else {
+        None
+      }
+    } catch {
+      case NonFatal(_) => None
+    }
+  }
+
+  private def extractDateAndTime(s: String, parsed: TemporalAccessor, allowTimeZone: Boolean):
+  (LocalDate, LocalTime) = {
+    if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
+      throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
+    }
+    val localDate = toLocalDate(parsed)
+    val localTime = toLocalTime(parsed)
+    (localDate, localTime)
+  }
+
   override def parseWithoutTimeZone(s: String, allowTimeZone: Boolean): Long = {
     try {
       val parsed = formatter.parse(s)
-      if (!allowTimeZone && parsed.query(TemporalQueries.zone()) != null) {
-        throw QueryExecutionErrors.cannotParseStringAsDataTypeError(pattern, s, TimestampNTZType)
-      }
-      val localDate = toLocalDate(parsed)
-      val localTime = toLocalTime(parsed)
+      val (localDate, localTime) = extractDateAndTime(s, parsed, allowTimeZone)
       DateTimeUtils.localDateTimeToMicros(LocalDateTime.of(localDate, localTime))

Review Comment:
   Let's move the common line to `extractMicrosNTZ()`:
   ```scala
         extractMicrosNTZ(s, parsed, allowTimeZone)
   ```



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala:
##########
@@ -472,4 +472,24 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite {
     assert(
       formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
   }
+
+  test("SPARK-39280: support returning optional parse results in the iso8601 formatter") {
+    val formatter = new Iso8601TimestampFormatter(
+      "yyyy-MM-dd HH:mm:ss.SSSS",
+      locale = DateFormatter.defaultLocale,
+      legacyFormat = LegacyDateFormats.SIMPLE_DATE_FORMAT,
+      isParsing = true, zoneId = DateTimeTestUtils.LA)
+    assert(formatter.parseOptional("9999-12-31 23:59:59.9990").contains(253402329599999000L))
+    assert(
+      formatter.parseWithoutTimeZoneOptional("9999-12-31 23:59:59.9990", false)
+        .contains(253402300799999000L))
+    assert(formatter.parseOptional("abc").isEmpty)
+    assert(
+      formatter.parseWithoutTimeZoneOptional("abc", false).isEmpty)
+
+    assert(formatter.parseOptional("2012-00-65 23:59:59.9990").isEmpty)
+    assert(
+      formatter.parseWithoutTimeZoneOptional("2012-00-65 23:59:59.9990", false)
+        .isEmpty)

Review Comment:
   ```suggestion
       assert(formatter.parseWithoutTimeZoneOptional("2012-00-65 23:59:59.9990", false).isEmpty)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Hisoka-X closed pull request #41078: [SPARK-39280][SQL] Fasten Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.

Hisoka-X closed pull request #41078: [SPARK-39280][SQL] Fasten Timestamp type inference with user-provided format in JSON/CSV data source
URL: https://github.com/apache/spark/pull/41078


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Hisoka-X commented on pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.

Hisoka-X commented on PR #41078:
URL: https://github.com/apache/spark/pull/41078#issuecomment-1537447586

   > Can you fill out the JIRA, and add a little bit of explanation here? 
   
   Done
   
   > It seems like you just avoid trying to extract more fields if basic parsing fails?
   
   Another formatter parsing method is used to prevent the formatter from throwing exceptions, which is similar to https://github.com/apache/spark/pull/36562 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Hisoka-X commented on a diff in pull request #41078: [SPARK-39280][SQL] Speed up Timestamp type inference with user-provided format in JSON/CSV data source

Posted by "Hisoka-X (via GitHub)" <gi...@apache.org>.

Hisoka-X commented on code in PR #41078:
URL: https://github.com/apache/spark/pull/41078#discussion_r1191261080


##########
sql/core/benchmarks/JsonBenchmark-results.txt:
##########
@@ -3,121 +3,121 @@ Benchmark for performance of JSON parsing
 ================================================================================================
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 JSON schema inferring:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        2973           3233         291          1.7         594.7       1.0X
-UTF-8 is set                                       4375           4796         430          1.1         874.9       0.7X
+No encoding                                        3871           3914          69          1.3         774.2       1.0X
+UTF-8 is set                                       5539           5563          26          0.9        1107.8       0.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 count a short column:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        2359           2404          39          2.1         471.8       1.0X
-UTF-8 is set                                       3814           3885         101          1.3         762.8       0.6X
+No encoding                                        2984           2999          24          1.7         596.9       1.0X
+UTF-8 is set                                       4875           4928          46          1.0         975.0       0.6X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 count a wide column:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                        4630           4969         347          0.2        4630.4       1.0X
-UTF-8 is set                                       8963           9040          82          0.1        8963.4       0.5X
+No encoding                                        6353           6446         143          0.2        6353.4       1.0X
+UTF-8 is set                                      10548          10647         163          0.1       10547.8       0.6X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 select wide row:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-No encoding                                       15252          15481         329          0.0      305030.9       1.0X
-UTF-8 is set                                      16349          16961         627          0.0      326988.8       0.9X
+No encoding                                       18807          18880          66          0.0      376130.9       1.0X
+UTF-8 is set                                      20530          20554          23          0.0      410593.2       0.9X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 Select a subset of 10 columns:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Select 10 columns                                  2290           2296           6          0.4        2289.6       1.0X
-Select 1 column                                    1636           1652          15          0.6        1635.6       1.4X
+Select 10 columns                                  2741           2749          12          0.4        2740.6       1.0X
+Select 1 column                                    1916           1925           8          0.5        1916.5       1.4X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 creation of JSON parser per line:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Short column without encoding                       661            673          12          1.5         661.1       1.0X
-Short column with UTF-8                             950            978          26          1.1         950.1       0.7X
-Wide column without encoding                      11106          11297         179          0.1       11106.4       0.1X
-Wide column with UTF-8                            13743          13762          18          0.1       13743.3       0.0X
+Short column without encoding                       901            934          29          1.1         900.8       1.0X
+Short column with UTF-8                            1320           1343          31          0.8        1319.9       0.7X
+Wide column without encoding                      13446          13544         103          0.1       13445.8       0.1X
+Wide column with UTF-8                            17770          17854          76          0.1       17770.0       0.1X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                           119            131          15          8.4         119.5       1.0X
-from_json                                          2475           2493          18          0.4        2474.9       0.0X
-json_tuple                                         2680           2745          57          0.4        2680.3       0.0X
-get_json_object                                    2549           2630          88          0.4        2549.3       0.0X
+Text read                                           159            167           9          6.3         159.2       1.0X
+from_json                                          2844           2863          25          0.4        2844.1       0.1X
+json_tuple                                         3137           3161          23          0.3        3136.7       0.1X
+get_json_object                                    2874           2884           9          0.3        2874.2       0.1X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 Dataset of json strings:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                           545            567          29          9.2         109.0       1.0X
-schema inferring                                   2460           2498          42          2.0         492.1       0.2X
-parsing                                            2618           2656          36          1.9         523.6       0.2X
+Text read                                           732            745          11          6.8         146.3       1.0X
+schema inferring                                   3260           3265           6          1.5         652.0       0.2X
+parsing                                            3592           3645          46          1.4         718.4       0.2X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 Json files in the per-line mode:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Text read                                           884            897          16          5.7         176.8       1.0X
-Schema inferring                                   3016           3029          21          1.7         603.2       0.3X
-Parsing without charset                            3251           3267          14          1.5         650.2       0.3X
-Parsing with UTF-8                                 4892           5020         118          1.0         978.3       0.2X
+Text read                                          1092           1100          11          4.6         218.4       1.0X
+Schema inferring                                   3814           3826          15          1.3         762.8       0.3X
+Parsing without charset                            4153           4184          32          1.2         830.7       0.3X
+Parsing with UTF-8                                 6014           6035          22          0.8        1202.9       0.2X
 
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 Write dates and timestamps:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Create a dataset of timestamps                      163            164           2          6.1         162.6       1.0X
-to_json(timestamp)                                 1307           1383          92          0.8        1307.4       0.1X
-write timestamps to files                          1044           1090          40          1.0        1044.5       0.2X
-Create a dataset of dates                           195            207          10          5.1         195.2       0.8X
-to_json(date)                                       915            934          19          1.1         914.8       0.2X
-write dates to files                                717            727           9          1.4         717.3       0.2X
+Create a dataset of timestamps                      193            198           4          5.2         193.5       1.0X
+to_json(timestamp)                                 1566           1582          14          0.6        1566.4       0.1X
+write timestamps to files                          1265           1274          14          0.8        1265.1       0.2X
+Create a dataset of dates                           232            239          10          4.3         231.9       0.8X
+to_json(date)                                      1037           1058          18          1.0        1037.2       0.2X
+write dates to files                                766            770           7          1.3         765.6       0.3X
 
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1036-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 Read dates and timestamps:                                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -----------------------------------------------------------------------------------------------------------------------------------------------------
-read timestamp text from files                                                   270            280           9          3.7         270.4       1.0X
-read timestamps from files                                                      2623           2789         159          0.4        2623.1       0.1X
-infer timestamps from files                                                     6416           7147         703          0.2        6415.7       0.0X
-read date text from files                                                        233            234           1          4.3         233.3       1.2X
-read date from files                                                             948            969          24          1.1         948.2       0.3X
-timestamp strings                                                                335            347          14          3.0         334.9       0.8X
-parse timestamps from Dataset[String]                                           2961           2993          41          0.3        2960.6       0.1X
-infer timestamps from Dataset[String]                                           7139           7314         158          0.1        7139.1       0.0X
-date strings                                                                     384            397          15          2.6         383.6       0.7X
-parse dates from Dataset[String]                                                1325           1347          24          0.8        1325.0       0.2X
-from_json(timestamp)                                                            4774           4788          13          0.2        4773.6       0.1X
-from_json(date)                                                                 3078           3090          11          0.3        3078.5       0.1X
-infer error timestamps from Dataset[String] with default format                 2025           2058          28          0.5        2025.0       0.1X
-infer error timestamps from Dataset[String] with user-provided format          20261          20338          95          0.0       20260.6       0.0X
-infer error timestamps from Dataset[String] with legacy format                  5495           5528          38          0.2        5495.4       0.0X
+read timestamp text from files                                                   283            289           6          3.5         283.1       1.0X
+read timestamps from files                                                      3364           3431          60          0.3        3363.6       0.1X
+infer timestamps from files                                                     8913           8935          38          0.1        8912.6       0.0X
+read date text from files                                                        263            267           4          3.8         262.9       1.1X
+read date from files                                                            1102           1116          12          0.9        1101.7       0.3X
+timestamp strings                                                                412            426          14          2.4         412.0       0.7X
+parse timestamps from Dataset[String]                                           3941           3956          14          0.3        3940.8       0.1X
+infer timestamps from Dataset[String]                                           9334           9383          43          0.1        9333.8       0.0X
+date strings                                                                     469            484          24          2.1         469.3       0.6X
+parse dates from Dataset[String]                                                1565           1572          11          0.6        1564.8       0.2X
+from_json(timestamp)                                                            5825           5917          88          0.2        5824.5       0.0X
+from_json(date)                                                                 3553           3574          19          0.3        3553.1       0.1X
+infer error timestamps from Dataset[String] with default format                 2590           2609          19          0.4        2589.9       0.1X
+infer error timestamps from Dataset[String] with user-provided format           2517           2551          30          0.4        2516.8       0.1X

Review Comment:
   @MaxGekk The bencnmark updated. `infer error timestamps from Dataset[String] with user-provided format` speed already up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org