You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by rx...@apache.org on 2016/02/15 01:00:22 UTC

spark git commit: [SPARK-13185][SQL] Reuse Calendar object in DateTimeUtils.StringToDate method to improve performance

Repository: spark
Updated Branches:
  refs/heads/master 22e9723d6 -> 7cb4d74c9


[SPARK-13185][SQL] Reuse Calendar object in DateTimeUtils.StringToDate method to improve performance

The java `Calendar` object is expensive to create. I have a sub query like this `SELECT a, b, c FROM table UV WHERE (datediff(UV.visitDate, '1997-01-01')>=0 AND datediff(UV.visitDate, '2015-01-01')<=0))`

The table stores `visitDate` as String type and has 3 billion records. A `Calendar` object is created every time `DateTimeUtils.stringToDate` is called. By reusing the `Calendar` object, I saw about 20 seconds performance improvement for this stage.

Author: Carson Wang <ca...@intel.com>

Closes #11090 from carsonwang/SPARK-13185.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7cb4d74c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7cb4d74c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7cb4d74c

Branch: refs/heads/master
Commit: 7cb4d74c98c2f1765b48a549f62e47b53ed29b38
Parents: 22e9723
Author: Carson Wang <ca...@intel.com>
Authored: Sun Feb 14 16:00:20 2016 -0800
Committer: Reynold Xin <rx...@databricks.com>
Committed: Sun Feb 14 16:00:20 2016 -0800

----------------------------------------------------------------------
 .../apache/spark/sql/catalyst/util/DateTimeUtils.scala    | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/7cb4d74c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
----------------------------------------------------------------------
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
index a159bc6..f184d72 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
@@ -59,6 +59,13 @@ object DateTimeUtils {
 
   @transient lazy val defaultTimeZone = TimeZone.getDefault
 
+  // Reuse the Calendar object in each thread as it is expensive to create in each method call.
+  private val threadLocalGmtCalendar = new ThreadLocal[Calendar] {
+    override protected def initialValue: Calendar = {
+      Calendar.getInstance(TimeZoneGMT)
+    }
+  }
+
   // Java TimeZone has no mention of thread safety. Use thread local instance to be safe.
   private val threadLocalLocalTimeZone = new ThreadLocal[TimeZone] {
     override protected def initialValue: TimeZone = {
@@ -408,7 +415,8 @@ object DateTimeUtils {
         segments(2) < 1 || segments(2) > 31) {
       return None
     }
-    val c = Calendar.getInstance(TimeZoneGMT)
+    val c = threadLocalGmtCalendar.get()
+    c.clear()
     c.set(segments(0), segments(1) - 1, segments(2), 0, 0, 0)
     c.set(Calendar.MILLISECOND, 0)
     Some((c.getTimeInMillis / MILLIS_PER_DAY).toInt)


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org