You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ssonker <gi...@git.apache.org> on 2018/06/07 08:56:07 UTC

[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

GitHub user ssonker opened a pull request:

    https://github.com/apache/spark/pull/21505

    [SPARK-24457][SQL] Improving performance of stringToTimestamp by cach…

    …ing Calendar instances for input timezones instead of creating new everytime
    
    ## What changes were proposed in this pull request?
    
    As of now, stringToTimestamp function in DateTimeUtils creates a calendar instance on each call. This change maintains a thread-local timezone to calendar map, and creates just one calendar for each timezone. Whenever a calendar instance is queried given a timezone, it is looked-up inside the map, reinitialized and returned.
    
    ## How was this patch tested?
    
    Using existing test cases.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ssonker/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21505.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21505
    
----
commit 84d5a911408411f327b620bb958b996e55264781
Author: Sharad Sonker <ss...@...>
Date:   2018-06-07T04:56:37Z

    [SPARK-24457][SQL] Improving performance of stringToTimestamp by caching Calendar instances for input timezones instead of creating new everytime

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    I need to update this PR with more changes, will open a new one! Closing this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r193679978
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,24 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
    +      override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
    +        mutable.Map[TimeZone, (Calendar, Long)]()
    +      }
    +    }
    +
    +  def getCalendar(timeZone: TimeZone): Calendar = {
    +    val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
    +      .getOrElseUpdate(timeZone, {
    +        val c = Calendar.getInstance(timeZone)
    +        (c, c.getTimeInMillis)
    +      })
    +    c.clear()
    +    c.setTimeInMillis(timeInMillis)
    --- End diff --
    
    @kiszk Thanks, I'm updating that. BTW, can you please help me understand a scenario where that is absolutely needed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r194365155
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,23 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    --- End diff --
    
    +1 for map approach


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r193688565
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,23 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    +      override def initialValue(): mutable.Map[TimeZone, Calendar] = {
    +        mutable.Map[TimeZone, Calendar]()
    +      }
    +    }
    +
    +  def getCalendar(timeZone: TimeZone): Calendar = {
    +    val c = threadLocalComputedCalendarsMap.get()
    +      .getOrElseUpdate(timeZone, {
    +        Calendar.getInstance(timeZone)
    +      })
    +    c.clear()
    --- End diff --
    
    Doesn't `clear` reset the timezone of that `Calendar` instance?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r194294961
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -125,7 +125,6 @@ object DateTimeUtils {
           .getOrElseUpdate(timeZone, {
             Calendar.getInstance(timeZone)
           })
    -    c.clear()
    --- End diff --
    
    Seems `setTimeInMillis` can result in all fields set. If so, `clear` is redundant.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker closed the pull request at:

    https://github.com/apache/spark/pull/21505


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r194299485
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,23 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    --- End diff --
    
    Yes, it should work functionally if we check a given time zone every time. 
    Do you know the typical access pattern of time zone? If there is temporal locality regarding time zone, we do not have to use `mutale.Map`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    ping @ssonker to update or close.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r194294182
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -125,7 +125,6 @@ object DateTimeUtils {
           .getOrElseUpdate(timeZone, {
             Calendar.getInstance(timeZone)
           })
    -    c.clear()
    --- End diff --
    
    @viirya @kiszk Do you agree with this commit?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    @kiszk @viirya Do you have more review comments that need to be incorporated? If not, can you please get this merged?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r194292883
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,23 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    --- End diff --
    
    @kiszk I think @viirya meant having just one thread-local calendar instance. That should also work, isn't it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r202548671
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/StringToTimestampBenchmark.scala ---
    @@ -0,0 +1,53 @@
    +/*
    --- End diff --
    
    Does this really need to go in the code base?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r193676953
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,24 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
    +      override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
    +        mutable.Map[TimeZone, (Calendar, Long)]()
    +      }
    +    }
    +
    +  def getCalendar(timeZone: TimeZone): Calendar = {
    +    val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
    +      .getOrElseUpdate(timeZone, {
    +        val c = Calendar.getInstance(timeZone)
    +        (c, c.getTimeInMillis)
    --- End diff --
    
    Yes, correct.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    @viirya Done.
    cc: @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r194847824
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/StringToTimestampBenchmark.scala ---
    @@ -0,0 +1,53 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql
    +
    +import java.util.Calendar
    +
    +import org.apache.spark.sql.catalyst.util.{DateTimeTestUtils, DateTimeUtils}
    +import org.apache.spark.util.Benchmark
    +
    +object StringToTimestampBenchmark {
    --- End diff --
    
    This benchmark tests `Calendar` creation, not string to timestamp. how abuot `DateTimeUtilsBenchmark`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    cc @cloud-fan  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    We would appreciate it if you put the performance before and after this PR?
    It would be good to use `Benchmark` class.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r193694372
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,23 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    +      override def initialValue(): mutable.Map[TimeZone, Calendar] = {
    +        mutable.Map[TimeZone, Calendar]()
    +      }
    +    }
    +
    +  def getCalendar(timeZone: TimeZone): Calendar = {
    +    val c = threadLocalComputedCalendarsMap.get()
    +      .getOrElseUpdate(timeZone, {
    +        Calendar.getInstance(timeZone)
    +      })
    +    c.clear()
    --- End diff --
    
    Nope. It clears all the ```fields``` and zone is not a field.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92969/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r194309823
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,23 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    --- End diff --
    
    How much performance down without a map here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    **[Test build #92969 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92969/testReport)** for PR 21505 at commit [`c940381`](https://github.com/apache/spark/commit/c940381a0be36fd227e8f63caf32d3be86c5aa69).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r193676413
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,24 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
    +      override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
    +        mutable.Map[TimeZone, (Calendar, Long)]()
    +      }
    +    }
    +
    +  def getCalendar(timeZone: TimeZone): Calendar = {
    +    val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
    +      .getOrElseUpdate(timeZone, {
    +        val c = Calendar.getInstance(timeZone)
    +        (c, c.getTimeInMillis)
    --- End diff --
    
    Isn't `timeInMillis` also stored when you first update this map entity? So next time you access this calendar, you just set it with the old `timeInMillis`. Isn't it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r193674578
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,24 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
    +      override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
    +        mutable.Map[TimeZone, (Calendar, Long)]()
    +      }
    +    }
    +
    +  def getCalendar(timeZone: TimeZone): Calendar = {
    +    val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
    +      .getOrElseUpdate(timeZone, {
    +        val c = Calendar.getInstance(timeZone)
    +        (c, c.getTimeInMillis)
    --- End diff --
    
    When you get the calendar instance next time, isn't its time out of date?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r194319491
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,23 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    --- End diff --
    
    I ran a benchmark and following is the output:
    ```string to timestamp calendar caching:    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    with map                                         8 /    9         12.9          77.7       1.0X
    without map                                     11 /   12          9.4         106.3       0.7X```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r193675674
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,24 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
    +      override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
    +        mutable.Map[TimeZone, (Calendar, Long)]()
    +      }
    +    }
    +
    +  def getCalendar(timeZone: TimeZone): Calendar = {
    +    val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
    +      .getOrElseUpdate(timeZone, {
    +        val c = Calendar.getInstance(timeZone)
    +        (c, c.getTimeInMillis)
    --- End diff --
    
    @viirya ^ Does that answer you question, or you mean something else?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r193686772
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,24 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
    +      override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
    +        mutable.Map[TimeZone, (Calendar, Long)]()
    +      }
    +    }
    +
    +  def getCalendar(timeZone: TimeZone): Calendar = {
    +    val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
    +      .getOrElseUpdate(timeZone, {
    +        val c = Calendar.getInstance(timeZone)
    +        (c, c.getTimeInMillis)
    +      })
    +    c.clear()
    +    c.setTimeInMillis(timeInMillis)
    --- End diff --
    
    @kiszk @viirya I've updated the code. Please review.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r193678440
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,24 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
    +      override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
    +        mutable.Map[TimeZone, (Calendar, Long)]()
    +      }
    +    }
    +
    +  def getCalendar(timeZone: TimeZone): Calendar = {
    +    val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
    +      .getOrElseUpdate(timeZone, {
    +        val c = Calendar.getInstance(timeZone)
    +        (c, c.getTimeInMillis)
    +      })
    +    c.clear()
    +    c.setTimeInMillis(timeInMillis)
    --- End diff --
    
    I agree with @viirya 's comment. Do we need to set the value of `System.currentTimeMillis()`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    **[Test build #92969 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92969/testReport)** for PR 21505 at commit [`c940381`](https://github.com/apache/spark/commit/c940381a0be36fd227e8f63caf32d3be86c5aa69).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    gentle ping @ssonker 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r206699663
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,22 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    +      override def initialValue(): mutable.Map[TimeZone, Calendar] = {
    +        mutable.Map[TimeZone, Calendar]()
    +      }
    +    }
    +
    +  def getCalendar(timeZone: TimeZone): Calendar = {
    --- End diff --
    
    Can this be private?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r193687670
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -114,20 +114,19 @@ object DateTimeUtils {
       }
     
       private val threadLocalComputedCalendarsMap =
    -    new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
    -      override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
    -        mutable.Map[TimeZone, (Calendar, Long)]()
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    +      override def initialValue(): mutable.Map[TimeZone, Calendar] = {
    +        mutable.Map[TimeZone, Calendar]()
           }
         }
     
       def getCalendar(timeZone: TimeZone): Calendar = {
    -    val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
    +    val c = threadLocalComputedCalendarsMap.get()
           .getOrElseUpdate(timeZone, {
    -        val c = Calendar.getInstance(timeZone)
    -        (c, c.getTimeInMillis)
    +        Calendar.getInstance(timeZone)
           })
         c.clear()
    -    c.setTimeInMillis(timeInMillis)
    +    c.setTimeInMillis(System.currentTimeMillis())
    --- End diff --
    
    oh, Calendar.getTimeInMillis and setTimeInMillis are also UTC-based.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    The PR title is too long and truncated. Can you shorten it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r193696451
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,23 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    --- End diff --
    
    Do we need to keep Calendar for many timezone? Since `getCalendar` takes a time zone input, we can just keep one Calendar instance, and set it with given timezone in `getCalendar`. WDYT? Regarding performance, is there big difference?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r193675439
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,24 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
    +      override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
    +        mutable.Map[TimeZone, (Calendar, Long)]()
    +      }
    +    }
    +
    +  def getCalendar(timeZone: TimeZone): Calendar = {
    +    val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
    +      .getOrElseUpdate(timeZone, {
    +        val c = Calendar.getInstance(timeZone)
    +        (c, c.getTimeInMillis)
    --- End diff --
    
    Please refer line 130 for this. Before returning the calendar instance, it is reinitialized to the time it was originally created.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r194305723
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,23 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    --- End diff --
    
    @kiszk @viirya I've tried running benchmarks for with/without ```mutable.Map``` implementation. Looks like setting timezone in a calendar instance is a costly operation and it drags the performance down. As the number of timezones cannot be large, maintaining a map will not be a huge memory overhead. So, I suggest going with the ```mutable.Map``` approach. Comments?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r194325153
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,23 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    --- End diff --
    
    I'm fine with the map approach.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r193687346
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -114,20 +114,19 @@ object DateTimeUtils {
       }
     
       private val threadLocalComputedCalendarsMap =
    -    new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
    -      override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = {
    -        mutable.Map[TimeZone, (Calendar, Long)]()
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    +      override def initialValue(): mutable.Map[TimeZone, Calendar] = {
    +        mutable.Map[TimeZone, Calendar]()
           }
         }
     
       def getCalendar(timeZone: TimeZone): Calendar = {
    -    val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
    +    val c = threadLocalComputedCalendarsMap.get()
           .getOrElseUpdate(timeZone, {
    -        val c = Calendar.getInstance(timeZone)
    -        (c, c.getTimeInMillis)
    +        Calendar.getInstance(timeZone)
           })
         c.clear()
    -    c.setTimeInMillis(timeInMillis)
    +    c.setTimeInMillis(System.currentTimeMillis())
    --- End diff --
    
    hmm, I think `System.currentTimeMillis()`  is UTC-based?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by ssonker <gi...@git.apache.org>.
Github user ssonker commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    @kiszk I've added a benchmark according to your recommendation, please review.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r194268734
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,23 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    --- End diff --
    
    Usually, only the default time zone is used. To execute `Cast` regarding date is called with a timezone may use another timezone. For the correctness, I think that it is necessary to support multiple timezones.
    
    To enable caching for default time zone and to create an instance for other time zones would also work correctly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21505: [SPARK-24457][SQL] Improving performance of stringToTime...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21505
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r206741169
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/StringToTimestampBenchmark.scala ---
    @@ -0,0 +1,53 @@
    +/*
    --- End diff --
    
    Yea, that's what I was thinking.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21505#discussion_r194848108
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -111,6 +113,22 @@ object DateTimeUtils {
         computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
       }
     
    +  private val threadLocalComputedCalendarsMap =
    +    new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
    +      override def initialValue(): mutable.Map[TimeZone, Calendar] = {
    +        mutable.Map[TimeZone, Calendar]()
    +      }
    +    }
    +
    +  def getCalendar(timeZone: TimeZone): Calendar = {
    +    val c = threadLocalComputedCalendarsMap.get()
    +      .getOrElseUpdate(timeZone, {
    --- End diff --
    
    nit: `getOrElseUpdate(timeZone, Calendar.getInstance(timeZone))`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org