You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ssonker <gi...@git.apache.org> on 2018/06/07 08:56:07 UTC
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
GitHub user ssonker opened a pull request:
https://github.com/apache/spark/pull/21505
[SPARK-24457][SQL] Improving performance of stringToTimestamp by cach…
…ing Calendar instances for input timezones instead of creating new everytime
## What changes were proposed in this pull request?
As of now, stringToTimestamp function in DateTimeUtils creates a calendar instance on each call. This change maintains a thread-local timezone to calendar map, and creates just one calendar for each timezone. Whenever a calendar instance is queried given a timezone, it is looked-up inside the map, reinitialized and returned.
## How was this patch tested?
Using existing test cases.
Please review http://spark.apache.org/contributing.html before opening a pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ssonker/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21505.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21505
----
commit 84d5a911408411f327b620bb958b996e55264781
Author: Sharad Sonker <ss...@...>
Date: 2018-06-07T04:56:37Z
[SPARK-24457][SQL] Improving performance of stringToTimestamp by caching Calendar instances for input timezones instead of creating new everytime
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on the issue:
https://github.com/apache/spark/pull/21505
I need to update this PR with more changes, will open a new one! Closing this.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r193679978
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+ override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
+ mutable.Map[TimeZone, (Calendar, Long)]()
+ }
+ }
+
+ def getCalendar(timeZone: TimeZone): Calendar = {
+ val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+ .getOrElseUpdate(timeZone, {
+ val c = Calendar.getInstance(timeZone)
+ (c, c.getTimeInMillis)
+ })
+ c.clear()
+ c.setTimeInMillis(timeInMillis)
--- End diff --
@kiszk Thanks, I'm updating that. BTW, can you please help me understand a scenario where that is absolutely needed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r194365155
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --
+1 for map approach
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r193688565
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
+ override def initialValue(): mutable.Map[TimeZone, Calendar] = {
+ mutable.Map[TimeZone, Calendar]()
+ }
+ }
+
+ def getCalendar(timeZone: TimeZone): Calendar = {
+ val c = threadLocalComputedCalendarsMap.get()
+ .getOrElseUpdate(timeZone, {
+ Calendar.getInstance(timeZone)
+ })
+ c.clear()
--- End diff --
Doesn't `clear` reset the timezone of that `Calendar` instance?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r194294961
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -125,7 +125,6 @@ object DateTimeUtils {
.getOrElseUpdate(timeZone, {
Calendar.getInstance(timeZone)
})
- c.clear()
--- End diff --
Seems `setTimeInMillis` can result in all fields set. If so, `clear` is redundant.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker closed the pull request at:
https://github.com/apache/spark/pull/21505
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r194299485
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --
Yes, it should work functionally if we check a given time zone every time.
Do you know the typical access pattern of time zone? If there is temporal locality regarding time zone, we do not have to use `mutale.Map`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21505
ping @ssonker to update or close.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21505
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21505
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r194294182
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -125,7 +125,6 @@ object DateTimeUtils {
.getOrElseUpdate(timeZone, {
Calendar.getInstance(timeZone)
})
- c.clear()
--- End diff --
@viirya @kiszk Do you agree with this commit?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21505
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on the issue:
https://github.com/apache/spark/pull/21505
@kiszk @viirya Do you have more review comments that need to be incorporated? If not, can you please get this merged?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r194292883
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --
@kiszk I think @viirya meant having just one thread-local calendar instance. That should also work, isn't it?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r202548671
--- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/StringToTimestampBenchmark.scala ---
@@ -0,0 +1,53 @@
+/*
--- End diff --
Does this really need to go in the code base?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/21505
LGTM
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r193676953
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+ override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
+ mutable.Map[TimeZone, (Calendar, Long)]()
+ }
+ }
+
+ def getCalendar(timeZone: TimeZone): Calendar = {
+ val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+ .getOrElseUpdate(timeZone, {
+ val c = Calendar.getInstance(timeZone)
+ (c, c.getTimeInMillis)
--- End diff --
Yes, correct.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on the issue:
https://github.com/apache/spark/pull/21505
@viirya Done.
cc: @cloud-fan
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r194847824
--- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/StringToTimestampBenchmark.scala ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.util.Calendar
+
+import org.apache.spark.sql.catalyst.util.{DateTimeTestUtils, DateTimeUtils}
+import org.apache.spark.util.Benchmark
+
+object StringToTimestampBenchmark {
--- End diff --
This benchmark tests `Calendar` creation, not string to timestamp. how abuot `DateTimeUtilsBenchmark`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/21505
cc @cloud-fan
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/21505
We would appreciate it if you put the performance before and after this PR?
It would be good to use `Benchmark` class.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r193694372
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
+ override def initialValue(): mutable.Map[TimeZone, Calendar] = {
+ mutable.Map[TimeZone, Calendar]()
+ }
+ }
+
+ def getCalendar(timeZone: TimeZone): Calendar = {
+ val c = threadLocalComputedCalendarsMap.get()
+ .getOrElseUpdate(timeZone, {
+ Calendar.getInstance(timeZone)
+ })
+ c.clear()
--- End diff --
Nope. It clears all the ```fields``` and zone is not a field.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21505
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92969/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r194309823
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --
How much performance down without a map here?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21505
**[Test build #92969 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92969/testReport)** for PR 21505 at commit [`c940381`](https://github.com/apache/spark/commit/c940381a0be36fd227e8f63caf32d3be86c5aa69).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r193676413
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+ override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
+ mutable.Map[TimeZone, (Calendar, Long)]()
+ }
+ }
+
+ def getCalendar(timeZone: TimeZone): Calendar = {
+ val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+ .getOrElseUpdate(timeZone, {
+ val c = Calendar.getInstance(timeZone)
+ (c, c.getTimeInMillis)
--- End diff --
Isn't `timeInMillis` also stored when you first update this map entity? So next time you access this calendar, you just set it with the old `timeInMillis`. Isn't it?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r193674578
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+ override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
+ mutable.Map[TimeZone, (Calendar, Long)]()
+ }
+ }
+
+ def getCalendar(timeZone: TimeZone): Calendar = {
+ val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+ .getOrElseUpdate(timeZone, {
+ val c = Calendar.getInstance(timeZone)
+ (c, c.getTimeInMillis)
--- End diff --
When you get the calendar instance next time, isn't its time out of date?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21505
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r194319491
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --
I ran a benchmark and following is the output:
```string to timestamp calendar caching: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
with map 8 / 9 12.9 77.7 1.0X
without map 11 / 12 9.4 106.3 0.7X```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r193675674
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+ override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
+ mutable.Map[TimeZone, (Calendar, Long)]()
+ }
+ }
+
+ def getCalendar(timeZone: TimeZone): Calendar = {
+ val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+ .getOrElseUpdate(timeZone, {
+ val c = Calendar.getInstance(timeZone)
+ (c, c.getTimeInMillis)
--- End diff --
@viirya ^ Does that answer you question, or you mean something else?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r193686772
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+ override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
+ mutable.Map[TimeZone, (Calendar, Long)]()
+ }
+ }
+
+ def getCalendar(timeZone: TimeZone): Calendar = {
+ val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+ .getOrElseUpdate(timeZone, {
+ val c = Calendar.getInstance(timeZone)
+ (c, c.getTimeInMillis)
+ })
+ c.clear()
+ c.setTimeInMillis(timeInMillis)
--- End diff --
@kiszk @viirya I've updated the code. Please review.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r193678440
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+ override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
+ mutable.Map[TimeZone, (Calendar, Long)]()
+ }
+ }
+
+ def getCalendar(timeZone: TimeZone): Calendar = {
+ val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+ .getOrElseUpdate(timeZone, {
+ val c = Calendar.getInstance(timeZone)
+ (c, c.getTimeInMillis)
+ })
+ c.clear()
+ c.setTimeInMillis(timeInMillis)
--- End diff --
I agree with @viirya 's comment. Do we need to set the value of `System.currentTimeMillis()`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21505
**[Test build #92969 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92969/testReport)** for PR 21505 at commit [`c940381`](https://github.com/apache/spark/commit/c940381a0be36fd227e8f63caf32d3be86c5aa69).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/21505
gentle ping @ssonker
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r206699663
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,22 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
+ override def initialValue(): mutable.Map[TimeZone, Calendar] = {
+ mutable.Map[TimeZone, Calendar]()
+ }
+ }
+
+ def getCalendar(timeZone: TimeZone): Calendar = {
--- End diff --
Can this be private?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r193687670
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -114,20 +114,19 @@ object DateTimeUtils {
}
private val threadLocalComputedCalendarsMap =
- new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
- override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
- mutable.Map[TimeZone, (Calendar, Long)]()
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
+ override def initialValue(): mutable.Map[TimeZone, Calendar] = {
+ mutable.Map[TimeZone, Calendar]()
}
}
def getCalendar(timeZone: TimeZone): Calendar = {
- val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+ val c = threadLocalComputedCalendarsMap.get()
.getOrElseUpdate(timeZone, {
- val c = Calendar.getInstance(timeZone)
- (c, c.getTimeInMillis)
+ Calendar.getInstance(timeZone)
})
c.clear()
- c.setTimeInMillis(timeInMillis)
+ c.setTimeInMillis(System.currentTimeMillis())
--- End diff --
oh, Calendar.getTimeInMillis and setTimeInMillis are also UTC-based.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/21505
The PR title is too long and truncated. Can you shorten it?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r193696451
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --
Do we need to keep Calendar for many timezone? Since `getCalendar` takes a time zone input, we can just keep one Calendar instance, and set it with given timezone in `getCalendar`. WDYT? Regarding performance, is there big difference?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r193675439
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+ override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
+ mutable.Map[TimeZone, (Calendar, Long)]()
+ }
+ }
+
+ def getCalendar(timeZone: TimeZone): Calendar = {
+ val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+ .getOrElseUpdate(timeZone, {
+ val c = Calendar.getInstance(timeZone)
+ (c, c.getTimeInMillis)
--- End diff --
Please refer line 130 for this. Before returning the calendar instance, it is reinitialized to the time it was originally created.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r194305723
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --
@kiszk @viirya I've tried running benchmarks for with/without ```mutable.Map``` implementation. Looks like setting timezone in a calendar instance is a costly operation and it drags the performance down. As the number of timezones cannot be large, maintaining a map will not be a huge memory overhead. So, I suggest going with the ```mutable.Map``` approach. Comments?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r194325153
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --
I'm fine with the map approach.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r193687346
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -114,20 +114,19 @@ object DateTimeUtils {
}
private val threadLocalComputedCalendarsMap =
- new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
- override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
- mutable.Map[TimeZone, (Calendar, Long)]()
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
+ override def initialValue(): mutable.Map[TimeZone, Calendar] = {
+ mutable.Map[TimeZone, Calendar]()
}
}
def getCalendar(timeZone: TimeZone): Calendar = {
- val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+ val c = threadLocalComputedCalendarsMap.get()
.getOrElseUpdate(timeZone, {
- val c = Calendar.getInstance(timeZone)
- (c, c.getTimeInMillis)
+ Calendar.getInstance(timeZone)
})
c.clear()
- c.setTimeInMillis(timeInMillis)
+ c.setTimeInMillis(System.currentTimeMillis())
--- End diff --
hmm, I think `System.currentTimeMillis()` is UTC-based?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on the issue:
https://github.com/apache/spark/pull/21505
@kiszk I've added a benchmark according to your recommendation, please review.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r194268734
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --
Usually, only the default time zone is used. To execute `Cast` regarding date is called with a timezone may use another timezone. For the correctness, I think that it is necessary to support multiple timezones.
To enable caching for default time zone and to create an instance for other time zones would also work correctly.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21505
ok to test
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r206741169
--- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/StringToTimestampBenchmark.scala ---
@@ -0,0 +1,53 @@
+/*
--- End diff --
Yea, that's what I was thinking.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/21505#discussion_r194848108
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
@@ -111,6 +113,22 @@ object DateTimeUtils {
computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
}
+ private val threadLocalComputedCalendarsMap =
+ new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
+ override def initialValue(): mutable.Map[TimeZone, Calendar] = {
+ mutable.Map[TimeZone, Calendar]()
+ }
+ }
+
+ def getCalendar(timeZone: TimeZone): Calendar = {
+ val c = threadLocalComputedCalendarsMap.get()
+ .getOrElseUpdate(timeZone, {
--- End diff --
nit: `getOrElseUpdate(timeZone, Calendar.getInstance(timeZone))`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org