You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "gagan taneja (JIRA)" <ji...@apache.org> on 2017/03/05 03:19:32 UTC
[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly

    [ https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896040#comment-15896040 ] 

gagan taneja commented on SPARK-19145:
--------------------------------------

Code responsible for this belongs to class amd the code below is responsible to type coercion. Although this is logically correct its also slowing down binary comparison because during execution Interval types are casted to String and the comparision is done on string operator resulting in 10x slower performance

org.apache.spark.sql.catalyst.analysis.TypeCoercion {
.....
....
// We should cast all relative timestamp/date/string comparison into string comparisons
      // This behaves as a user would expect because timestamp strings sort lexicographically.
      // i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true
      case p @ BinaryComparison(left @ StringType(), right @ DateType()) =>
        p.makeCopy(Array(left, Cast(right, StringType)))
      case p @ BinaryComparison(left @ DateType(), right @ StringType()) =>
        p.makeCopy(Array(Cast(left, StringType), right))
      case p @ BinaryComparison(left @ StringType(), right @ TimestampType()) =>
        p.makeCopy(Array(left, Cast(right, StringType)))
      case p @ BinaryComparison(left @ TimestampType(), right @ StringType()) =>
        p.makeCopy(Array(Cast(left, StringType), right))

      // Comparisons between dates and timestamps.
      case p @ BinaryComparison(left @ TimestampType(), right @ DateType()) =>
        p.makeCopy(Array(Cast(left, StringType), Cast(right, StringType)))
      case p @ BinaryComparison(left @ DateType(), right @ TimestampType()) =>
        p.makeCopy(Array(Cast(left, StringType), Cast(right, StringType)))


> Timestamp to String casting is slowing the query significantly
> --------------------------------------------------------------
>
>                 Key: SPARK-19145
>                 URL: https://issues.apache.org/jira/browse/SPARK-19145
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: gagan taneja
>
> i have a time series table with timestamp column 
> Following query
> SELECT COUNT(*) AS `count`
>                FROM `default`.`table`
>                WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 50000
> is significantly SLOWER than 
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','YYYY-MM-DD HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','YYYY-MM-DD HH24:MI:SS−0800') LIMIT 50000
> After investigation i found that in the first query time colum is cast to String before applying the filter 
> However in the second query no such casting is performed and its a filter with long value 
> Below are the generate Physical plan for slower execution followed by physical plan for faster execution 
> SELECT COUNT(*) AS `count`
>                FROM `default`.`table`
>                WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 50000
> == Physical Plan ==
> CollectLimit 50000
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L])
>    +- Exchange SinglePartition
>       +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#3339L])
>          +- *Project
>             +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 19:53:51))
>                +- *FileScan parquet default.cstat[time#3314] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: struct<time:timestamp>
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','YYYY-MM-DD HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','YYYY-MM-DD HH24:MI:SS−0800') LIMIT 50000
> == Physical Plan ==
> CollectLimit 50000
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L])
>    +- Exchange SinglePartition
>       +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#3287L])
>          +- *Project
>             +- *Filter ((isnotnull(time#3262) && (time#3262 >= 1483404831000000)) && (time#3262 <= 1484009631000000))
>                +- *FileScan parquet default.cstat[time#3262] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], PartitionFilters: [], PushedFilters: [IsNotNull(time), GreaterThanOrEqual(time,2017-01-02 19:53:51.0), LessThanOrEqual(time,2017-01-09..., ReadSchema: struct<time:timestamp>
> In Impala both query run efficiently without and performance difference
> Spark should be able to parse the Date string and convert to Long/Timestamp during generation of Optimized Logical Plan so that both the query would have similar performance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org