You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shubham Chaurasia (Jira)" <ji...@apache.org> on 2021/03/09 14:06:00 UTC
[jira] [Updated] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

     [ https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chaurasia updated SPARK-34675:
--------------------------------------
    Description: 
Inserted following data with UTC as both JVM and session timezone.

Spark-shell launch command
{code}
bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf spark.executor.extraJavaOptions='-Duser.timezone=UTC'
{code}

Table creation  
{code:scala}
sql("use ts").show
sql("create table spark_parquet(type string, t timestamp) stored as parquet").show
sql("create table spark_orc(type string, t timestamp) stored as orc").show
sql("create table spark_avro(type string, t timestamp) stored as avro").show
sql("create table spark_text(type string, t timestamp) stored as textfile").show
sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 01:02:03')").show
sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 01:02:03')").show
sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 01:02:03')").show
sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 01:02:03')").show
{code}

Used following function to check and verify the returned timestamps
{code:scala}
scala> :paste
// Entering paste mode (ctrl-D to finish)

def showTs(
    db: String,
    tables: String*
): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
  sql("use " + db).show
  import scala.collection.mutable.ListBuffer
  var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
  for (tbl <- tables) {
    val query = "select * from " + tbl
    println("Executing - " + query);
    results += sql(query)
  }
  println("user.timezone - " + System.getProperty("user.timezone"))
  println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
  println("spark.sql.session.timeZone - " + spark.conf.get("spark.sql.session.timeZone"))
  var unionDf = results(0)
  for (i <- 1 until results.length) {
    unionDf = unionDf.unionAll(results(i))
  }
  val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), r.getTimestamp(1).getTime))
  val renamed = augmented.withColumnRenamed("_1", "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
renamed.show(false)
  return renamed
}

// Exiting paste mode, now interpreting.

scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
++
||
++
++

Executing - select * from spark_parquet
Executing - select * from spark_orc
Executing - select * from spark_avro
Executing - select * from spark_text
user.timezone - UTC
TimeZone.getDefault - UTC
spark.sql.session.timeZone - UTC
+----------------------+-------------------+------------+                       
|type                  |ts                 |millis      |
+----------------------+-------------------+------------+
|FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
|FROM SPARK-EXT ORC    |1989-01-05 01:02:03|599965323000|
|FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
|FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
+----------------------+-------------------+------------+
{code}

1. Set session timezone to America/Los_Angeles
{code:scala}

scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
++
||
++
++

Executing - select * from spark_parquet
Executing - select * from spark_orc
Executing - select * from spark_avro
Executing - select * from spark_text
user.timezone - UTC
TimeZone.getDefault - UTC
spark.sql.session.timeZone - America/Los_Angeles
+----------------------+-------------------+------------+
|type                  |ts                 |millis      |
+----------------------+-------------------+------------+
|FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT ORC    |1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT AVRO   |1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT TEXT   |1989-01-04 17:02:03|599965323000|
+----------------------+-------------------+------------+
{code}

2. Started shell (JVM) in America/Los_Angeles timezone (which sets session timezone also to America/Los_Angeles)
{code:scala}
bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=America/Los_Angeles' --conf spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles'

scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
Hive Session ID = 10ff355c-318d-4cb8-870f-a388652133b1
++
||
++
++

Executing - select * from spark_parquet
Executing - select * from spark_orc
Executing - select * from spark_avro
Executing - select * from spark_text
user.timezone - America/Los_Angeles
TimeZone.getDefault - America/Los_Angeles
spark.sql.session.timeZone - America/Los_Angeles
+----------------------+-------------------+------------+                       
|type                  |ts                 |millis      |
+----------------------+-------------------+------------+
|FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT ORC    |1989-01-05 01:02:03|599994123000|
|FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599994123000|
|FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599994123000|
+----------------------+-------------------+------------+
{code}


As we can see in 1 and 2, parquet and other formats are behaving differently. 

In 1 - {{1989-01-04 17:02:03|599965323000}} seems correct according to {{America/Los_Angeles}} timezone as the original value inserted in UTC was {{1989-01-05 01:02:03 which is equal to 599965323000 UTC}}

In 2 - only parquet seems to be correct while the other formats are producing {{1989-01-05 01:02:03|599994123000}} which should not be the case if we are using a different timezone ( {{America/Los_Angeles}} ).  I think they are coming from individual file format readers (avro, orc, text) but then they don't produce a converted value in  {{America/Los_Angeles}} timezone.  I saw orc reader adjusting offset according to writer and reader(JVM) timezone(probably avro is doing the same) but then we are not seeing the end value in spark according to {{spark.sql.session.timeZone}}

Are there any guidelines/docs around how to use timezones with spark ? 

cc [~cloud_fan] [~hyukjin.kwon] [~dongjoon]


  was:
Inserted following data with UTC as both JVM and session timezone.

Spark-shell launch command
{code}
bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf spark.executor.extraJavaOptions='-Duser.timezone=UTC'
{code}

Table creation  
{code:scala}
sql("use ts").show
sql("create table spark_parquet(type string, t timestamp) stored as parquet").show
sql("create table spark_orc(type string, t timestamp) stored as orc").show
sql("create table spark_avro(type string, t timestamp) stored as avro").show
sql("create table spark_text(type string, t timestamp) stored as textfile").show
sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 01:02:03')").show
sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 01:02:03')").show
sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 01:02:03')").show
sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 01:02:03')").show
{code}

Used following function to check and verify the returned timestamps
{code:scala}
scala> :paste
// Entering paste mode (ctrl-D to finish)

def showTs(
    db: String,
    tables: String*
): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
  sql("use " + db).show
  import scala.collection.mutable.ListBuffer
  var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
  for (tbl <- tables) {
    val query = "select * from " + tbl
    println("Executing - " + query);
    results += sql(query)
  }
  println("user.timezone - " + System.getProperty("user.timezone"))
  println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
  println("spark.sql.session.timeZone - " + spark.conf.get("spark.sql.session.timeZone"))
  var unionDf = results(0)
  for (i <- 1 until results.length) {
    unionDf = unionDf.unionAll(results(i))
  }
  val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), r.getTimestamp(1).getTime))
  val renamed = augmented.withColumnRenamed("_1", "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
renamed.show(false)
  return renamed
}

// Exiting paste mode, now interpreting.

scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
++
||
++
++

Executing - select * from spark_parquet
Executing - select * from spark_orc
Executing - select * from spark_avro
Executing - select * from spark_text
user.timezone - UTC
TimeZone.getDefault - UTC
spark.sql.session.timeZone - UTC
+----------------------+-------------------+------------+                       
|type                  |ts                 |millis      |
+----------------------+-------------------+------------+
|FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
|FROM SPARK-EXT ORC    |1989-01-05 01:02:03|599965323000|
|FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
|FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
+----------------------+-------------------+------------+
{code}

1. Set session timezone to America/Los_Angeles
{code:scala}

scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
++
||
++
++

Executing - select * from spark_parquet
Executing - select * from spark_orc
Executing - select * from spark_avro
Executing - select * from spark_text
user.timezone - UTC
TimeZone.getDefault - UTC
spark.sql.session.timeZone - America/Los_Angeles
+----------------------+-------------------+------------+
|type                  |ts                 |millis      |
+----------------------+-------------------+------------+
|FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT ORC    |1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT AVRO   |1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT TEXT   |1989-01-04 17:02:03|599965323000|
+----------------------+-------------------+------------+
{code}

2. Started shell (JVM) in America/Los_Angeles timezone (which sets session timezone also to America/Los_Angeles)
{code:scala}
bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=America/Los_Angeles' --conf spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles'

scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
Hive Session ID = 10ff355c-318d-4cb8-870f-a388652133b1
++
||
++
++

Executing - select * from spark_parquet
Executing - select * from spark_orc
Executing - select * from spark_avro
Executing - select * from spark_text
user.timezone - America/Los_Angeles
TimeZone.getDefault - America/Los_Angeles
spark.sql.session.timeZone - America/Los_Angeles
+----------------------+-------------------+------------+                       
|type                  |ts                 |millis      |
+----------------------+-------------------+------------+
|FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT ORC    |1989-01-05 01:02:03|599994123000|
|FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599994123000|
|FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599994123000|
+----------------------+-------------------+------------+
{code}


As we can see in 1 and 2, parquet and other formats are behaving differently. 

In 1 - {{1989-01-04 17:02:03|599965323000}} seems correct according to {{America/Los_Angeles}} timezone as the original value inserted in UTC was {{1989-01-05 01:02:03 which is equal to 599965323000 UTC}}

In 2 - only parquet seems to be correct while the other formats are producing {{1989-01-05 01:02:03|599994123000}} which should not be the case if we are using a different timezone ( {{America/Los_Angeles}} ).  I think they are coming from individual file format readers (avro, orc, text) but then they don't produce a converted value in  {{America/Los_Angeles}} timezone.  I saw orc reader adjusting offset according to writer and reader(JVM) timezone(probably avro is doing the same) but then we are not seeing the end value in spark according to {{spark.sql.session.timeZone}}

Are there any guidelines/docs around how to use timezones with spark ? 



> TimeZone inconsistencies when JVM and session timezones are different
> ---------------------------------------------------------------------
>
>                 Key: SPARK-34675
>                 URL: https://issues.apache.org/jira/browse/SPARK-34675
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.7
>            Reporter: Shubham Chaurasia
>            Priority: Major
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
>     db: String,
>     tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
>     val query = "select * from " + tbl
>     println("Executing - " + query);
>     results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
>     unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), r.getTimestamp(1).getTime))
>   val renamed = augmented.withColumnRenamed("_1", "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
> renamed.show(false)
>   return renamed
> }
> // Exiting paste mode, now interpreting.
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - UTC
> +----------------------+-------------------+------------+                       
> |type                  |ts                 |millis      |
> +----------------------+-------------------+------------+
> |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT ORC    |1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
> +----------------------+-------------------+------------+
> {code}
> 1. Set session timezone to America/Los_Angeles
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - America/Los_Angeles
> +----------------------+-------------------+------------+
> |type                  |ts                 |millis      |
> +----------------------+-------------------+------------+
> |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT ORC    |1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-04 17:02:03|599965323000|
> +----------------------+-------------------+------------+
> {code}
> 2. Started shell (JVM) in America/Los_Angeles timezone (which sets session timezone also to America/Los_Angeles)
> {code:scala}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=America/Los_Angeles' --conf spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles'
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = 10ff355c-318d-4cb8-870f-a388652133b1
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - America/Los_Angeles
> TimeZone.getDefault - America/Los_Angeles
> spark.sql.session.timeZone - America/Los_Angeles
> +----------------------+-------------------+------------+                       
> |type                  |ts                 |millis      |
> +----------------------+-------------------+------------+
> |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT ORC    |1989-01-05 01:02:03|599994123000|
> |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599994123000|
> |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599994123000|
> +----------------------+-------------------+------------+
> {code}
> As we can see in 1 and 2, parquet and other formats are behaving differently. 
> In 1 - {{1989-01-04 17:02:03|599965323000}} seems correct according to {{America/Los_Angeles}} timezone as the original value inserted in UTC was {{1989-01-05 01:02:03 which is equal to 599965323000 UTC}}
> In 2 - only parquet seems to be correct while the other formats are producing {{1989-01-05 01:02:03|599994123000}} which should not be the case if we are using a different timezone ( {{America/Los_Angeles}} ).  I think they are coming from individual file format readers (avro, orc, text) but then they don't produce a converted value in  {{America/Los_Angeles}} timezone.  I saw orc reader adjusting offset according to writer and reader(JVM) timezone(probably avro is doing the same) but then we are not seeing the end value in spark according to {{spark.sql.session.timeZone}}
> Are there any guidelines/docs around how to use timezones with spark ? 
> cc [~cloud_fan] [~hyukjin.kwon] [~dongjoon]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org