You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by squito <gi...@git.apache.org> on 2017/02/02 17:04:34 UTC

[GitHub] spark pull request #16781: [SPARK-12297][SQL][POC] Hive compatibility for Pa...

GitHub user squito opened a pull request:

    https://github.com/apache/spark/pull/16781

    [SPARK-12297][SQL][POC] Hive compatibility for Parquet Timestamps

    ## What changes were proposed in this pull request?
    
    Hive has very strange behavior when writing timestamps to parquet data.  It will always apply the conversion from the local timezone to UTC, which it then reverses when reading the data back.  For compatibility with Hive, Spark should provide an option for doing the same conversion, when necessary, based on table metadata.  This goes along with HIVE-12767.
    
    Note that the default for Spark remains unchanged; created tables are marked as UTC, which means the read and write path remains unchanged (and avoids slow timezone logic).  The major use case is that *legacy* tables written by hive can now be read in correctly by Spark, as long as the appropriate table properties are set.
    
    ## How was this patch tested?
    
    Added a unit test which creates tables, reads and writes data, under a variety of permutations (conf on whether or not to add a default tz to new tables, different explicit timezones, vectorized reading on and off).
    
    TODO
    * [ ] predicate pushdown

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/squito/spark SPARK-12297

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16781.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16781
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112362704
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -141,4 +160,326 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
           Row(Seq(Row(1))),
           "ARRAY<STRUCT<array_element: INT>>")
       }
    +
    +  val testTimezones = Seq(
    +    "UTC" -> "UTC",
    +    "LA" -> "America/Los_Angeles",
    +    "Berlin" -> "Europe/Berlin"
    +  )
    +  // Check creating parquet tables with timestamps, writing data into them, and reading it back out
    +  // under a variety of conditions:
    +  // * tables with explicit tz and those without
    +  // * altering table properties directly
    +  // * variety of timezones, local & non-local
    +  val sessionTimezones = testTimezones.map(_._2).map(Some(_)) ++ Seq(None)
    +  sessionTimezones.foreach { sessionTzOpt =>
    +    val sparkSession = spark.newSession()
    +    sessionTzOpt.foreach { tz => sparkSession.conf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, tz) }
    +    testCreateWriteRead(sparkSession, "no_tz", None, sessionTzOpt)
    +    val localTz = TimeZone.getDefault.getID()
    +    testCreateWriteRead(sparkSession, "local", Some(localTz), sessionTzOpt)
    +    // check with a variety of timezones.  The unit tests currently are configured to always use
    +    // America/Los_Angeles, but even if they didn't, we'd be sure to cover a non-local timezone.
    +    Seq(
    +      "UTC" -> "UTC",
    +      "LA" -> "America/Los_Angeles",
    +      "Berlin" -> "Europe/Berlin"
    +    ).foreach { case (tableName, zone) =>
    +      if (zone != localTz) {
    +        testCreateWriteRead(sparkSession, tableName, Some(zone), sessionTzOpt)
    +      }
    +    }
    +  }
    +
    +  private def testCreateWriteRead(
    +      sparkSession: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    testCreateAlterTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testWriteTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testReadTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +  }
    +
    +  private def checkHasTz(table: String, tz: Option[String]): Unit = {
    +    val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(table))
    +    assert(tableMetadata.properties.get(ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY) === tz)
    +  }
    +
    +  private def testCreateAlterTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    test(s"SPARK-12297: Create and Alter Parquet tables and timezones; explicitTz = $explicitTz; " +
    +      s"sessionTzOpt = $sessionTzOpt") {
    +      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +      withTable(baseTable, s"like_$baseTable", s"select_$baseTable") {
    +        val localTz = TimeZone.getDefault()
    +        val localTzId = localTz.getID()
    +        // If we ever add a property to set the table timezone by default, defaultTz would change
    +        val defaultTz = None
    +        // check that created tables have correct TBLPROPERTIES
    +        val tblProperties = explicitTz.map {
    +          tz => raw"""TBLPROPERTIES ($key="$tz")"""
    +        }.getOrElse("")
    +        spark.sql(
    +          raw"""CREATE TABLE $baseTable (
    +                |  x int
    +                | )
    +                | STORED AS PARQUET
    +                | $tblProperties
    +            """.stripMargin)
    +        val expectedTableTz = explicitTz.orElse(defaultTz)
    +        checkHasTz(baseTable, expectedTableTz)
    +        spark.sql(s"CREATE TABLE like_$baseTable LIKE $baseTable")
    +        checkHasTz(s"like_$baseTable", expectedTableTz)
    +        spark.sql(
    +          raw"""CREATE TABLE select_$baseTable
    +                | STORED AS PARQUET
    +                | AS
    +                | SELECT * from $baseTable
    +            """.stripMargin)
    +        checkHasTz(s"select_$baseTable", defaultTz)
    +
    +        // check alter table, setting, unsetting, resetting the property
    +        spark.sql(
    +          raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="America/Los_Angeles")""")
    +        checkHasTz(baseTable, Some("America/Los_Angeles"))
    +        spark.sql( raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="UTC")""")
    +        checkHasTz(baseTable, Some("UTC"))
    +        spark.sql( raw"""ALTER TABLE $baseTable UNSET TBLPROPERTIES ($key)""")
    +        checkHasTz(baseTable, None)
    +        explicitTz.foreach { tz =>
    +          spark.sql( raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="$tz")""")
    +          checkHasTz(baseTable, expectedTableTz)
    +        }
    +      }
    +    }
    +  }
    +
    +  val desiredTimestampStrings = Seq(
    +    "2015-12-31 22:49:59.123",
    +    "2015-12-31 23:50:59.123",
    +    "2016-01-01 00:39:59.123",
    +    "2016-01-01 01:29:59.123"
    +  )
    +  // We don't want to mess with timezones inside the tests themselves, since we use a shared
    +  // spark context, and then we might be prone to issues from lazy vals for timezones.  Instead,
    +  // we manually adjust the timezone just to determine what the desired millis (since epoch, in utc)
    +  // is for various "wall-clock" times in different timezones, and then we can compare against those
    +  // in our tests.
    +  val originalTz = TimeZone.getDefault
    +  val timestampTimezoneToMillis = try {
    +    (for {
    +      timestampString <- desiredTimestampStrings
    +      timezone <- Seq("America/Los_Angeles", "Europe/Berlin", "UTC").map {
    +        TimeZone.getTimeZone(_)
    +      }
    +    } yield {
    +      TimeZone.setDefault(timezone)
    +      val timestamp = Timestamp.valueOf(timestampString)
    +      (timestampString, timezone.getID()) -> timestamp.getTime()
    +    }).toMap
    +  } finally {
    +    TimeZone.setDefault(originalTz)
    +  }
    +
    +  private def createRawData(spark: SparkSession): Dataset[(String, Timestamp)] = {
    +    val rowRdd = spark.sparkContext.parallelize(desiredTimestampStrings, 1).map(Row(_))
    +    val schema = StructType(Seq(
    +      StructField("display", StringType, true)
    +    ))
    +    val df = spark.createDataFrame(rowRdd, schema)
    +    // this will get the millis corresponding to the display time given the current *session*
    +    // timezone.
    +    import spark.implicits._
    +    df.withColumn("ts", expr("cast(display as timestamp)")).as[(String, Timestamp)]
    +  }
    +
    +  private def testWriteTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]) : Unit = {
    +    val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +    test(s"SPARK-12297: Write to Parquet tables with Timestamps; explicitTz = $explicitTz; " +
    +        s"sessionTzOpt = $sessionTzOpt") {
    +
    +      withTable(s"saveAsTable_$baseTable", s"insert_$baseTable") {
    +        val sessionTzId = sessionTzOpt.getOrElse(TimeZone.getDefault().getID())
    +        // check that created tables have correct TBLPROPERTIES
    +        val tblProperties = explicitTz.map {
    +          tz => raw"""TBLPROPERTIES ($key="$tz")"""
    +        }.getOrElse("")
    +
    +
    +        val rawData = createRawData(spark)
    +        // Check writing data out.
    +        // We write data into our tables, and then check the raw parquet files to see whether
    +        // the correct conversion was applied.
    +        rawData.write.saveAsTable(s"saveAsTable_$baseTable")
    +        checkHasTz(s"saveAsTable_$baseTable", None)
    +        spark.sql(
    +          raw"""CREATE TABLE insert_$baseTable (
    +                |  display string,
    +                |  ts timestamp
    +                | )
    +                | STORED AS PARQUET
    +                | $tblProperties
    +               """.stripMargin)
    +        checkHasTz(s"insert_$baseTable", explicitTz)
    +        rawData.write.insertInto(s"insert_$baseTable")
    +        // no matter what, roundtripping via the table should leave the data unchanged
    +        val readFromTable = spark.table(s"insert_$baseTable").collect()
    +          .map { row => (row.getAs[String](0), row.getAs[Timestamp](1)).toString() }.sorted
    +        assert(readFromTable === rawData.collect().map(_.toString()).sorted)
    +
    +        // Now we load the raw parquet data on disk, and check if it was adjusted correctly.
    +        // Note that we only store the timezone in the table property, so when we read the
    +        // data this way, we're bypassing all of the conversion logic, and reading the raw
    +        // values in the parquet file.
    +        val onDiskLocation = spark.sessionState.catalog
    +          .getTableMetadata(TableIdentifier(s"insert_$baseTable")).location.getPath
    +        // we test reading the data back with and without the vectorized reader, to make sure we
    +        // haven't broken reading parquet from non-hive tables, with both readers.
    +        Seq(false, true).foreach { vectorized =>
    +          spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, vectorized)
    +          val readFromDisk = spark.read.parquet(onDiskLocation).collect()
    +          val storageTzId = explicitTz.getOrElse(sessionTzId)
    +          readFromDisk.foreach { row =>
    +            val displayTime = row.getAs[String](0)
    +            val millis = row.getAs[Timestamp](1).getTime()
    +            val expectedMillis = timestampTimezoneToMillis((displayTime, storageTzId))
    +            assert(expectedMillis === millis, s"Display time '$displayTime' was stored " +
    +              s"incorrectly with sessionTz = ${sessionTzOpt}; Got $millis, expected " +
    +              s"$expectedMillis (delta = ${millis - expectedMillis})")
    +          }
    +        }
    +      }
    +    }
    +  }
    +
    +  private def testReadTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +    test(s"SPARK-12297: Read from Parquet tables with Timestamps; explicitTz = $explicitTz; " +
    +      s"sessionTzOpt = $sessionTzOpt") {
    +      withTable(s"external_$baseTable") {
    +        // we intentionally save this data directly, without creating a table, so we can
    +        // see that the data is read back differently depending on table properties.
    +        // we'll save with adjusted millis, so that it should be the correct millis after reading
    +        // back.
    +        val rawData = createRawData(spark)
    +        // to avoid closing over entire class
    +        val timestampTimezoneToMillis = this.timestampTimezoneToMillis
    +        import spark.implicits._
    +        val adjustedRawData = (explicitTz match {
    +          case Some(tzId) =>
    +            rawData.map { case (displayTime, _) =>
    +              val storageMillis = timestampTimezoneToMillis((displayTime, tzId))
    +              (displayTime, new Timestamp(storageMillis))
    +            }
    +          case _ =>
    +            rawData
    +        }).withColumnRenamed("_1", "display").withColumnRenamed("_2", "ts")
    +        withTempPath { path =>
    +          adjustedRawData.write.parquet(path.getCanonicalPath)
    +          val options = Map("path" -> path.getCanonicalPath) ++
    +            explicitTz.map { tz => Map(key -> tz) }.getOrElse(Map())
    +
    +          spark.catalog.createTable(
    +            tableName = s"external_$baseTable",
    +            source = "parquet",
    +            schema = new StructType().add("display", StringType).add("ts", TimestampType),
    +            options = options
    +          )
    +          Seq(false, true).foreach { vectorized =>
    +            withClue(s"vectorized = $vectorized;") {
    +              spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, vectorized)
    --- End diff --
    
    I was initially using `SQLTestUtils.withSQLConf`, but I discovered that it wasn't actually taking any effect. I dunno if that is because `TestHiveSingleton` does something strange, or maybe I'm doing something else weird in this test by creating many new spark sessions.  But I did that because it was the only way I could get the conf changes applied consistently.
    
    Since I am creating new sessions, I don't think this has any risk of a failed test not cleaning and triggering failures in other tests outside of this suite.  But it still seems like I might be doing something wrong ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112372155
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala ---
    @@ -673,12 +679,35 @@ private[parquet] object ParquetRowConverter {
         unscaled
       }
     
    -  def binaryToSQLTimestamp(binary: Binary): SQLTimestamp = {
    +  /**
    +   * Converts an int96 to a SQLTimestamp, given both the storage timezone and the local timezone.
    +   * The timestamp is really meant to be interpreted as a "floating time", but since we
    +   * actually store it as micros since epoch, why we have to apply a conversion when timezones
    +   * change.
    +   *
    +   * @param binary
    +   * @param sessionTz the session timezone.  This will be used to determine how to display the time,
    +    *                  and compute functions on the timestamp which involve a timezone, eg. extract
    +    *                  the hour.
    +   * @param storageTz the timezone which was used to store the timestamp.  This should come from the
    +    *                  timestamp table property, or else assume its the same as the sessionTz
    +   * @return
    --- End diff --
    
    Can you also add descriptions for `@param binary` and `@return`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112370857
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -42,6 +52,15 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
            """.stripMargin)
       }
     
    +  override def afterEach(): Unit = {
    --- End diff --
    
    Why do we need this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74031 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74031/testReport)** for PR 16781 at commit [`2c8a228`](https://github.com/apache/spark/commit/2c8a22811f404c751841e9d9f2e8b22780d60f99).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Did we conduct any performance tests on this patch?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74688 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74688/testReport)** for PR 16781 at commit [`38e19cd`](https://github.com/apache/spark/commit/38e19cdd497992fb063cb39d1d65bde1622553e4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74609 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74609/testReport)** for PR 16781 at commit [`f4dca27`](https://github.com/apache/spark/commit/f4dca27e301c1ce94e7175c6a852a4e94cea0555).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r114473069
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -17,12 +17,23 @@
     
     package org.apache.spark.sql.hive
     
    +import java.io.File
    +import java.net.URLDecoder
     import java.sql.Timestamp
    +import java.util.TimeZone
     
    -import org.apache.spark.sql.Row
    -import org.apache.spark.sql.execution.datasources.parquet.ParquetCompatibilityTest
    +import org.apache.hadoop.fs.{FileSystem, Path}
    +import org.apache.parquet.hadoop.ParquetFileReader
    +import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName
    +import org.scalatest.BeforeAndAfterEach
    --- End diff --
    
    nit: unnecessary import.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #75177 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75177/testReport)** for PR 16781 at commit [`a96806f`](https://github.com/apache/spark/commit/a96806fa4011e28dcc875d0fb519501013b91a1b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #75686 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75686/testReport)** for PR 16781 at commit [`75e8579`](https://github.com/apache/spark/commit/75e8579551c2f1ae7ae958853754fbbc5a589dd4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74611 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74611/testReport)** for PR 16781 at commit [`d951443`](https://github.com/apache/spark/commit/d951443d7dddd187fe119f01cb9ee16459f4a346).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL][POC] Hive compatibility for Pa...

Posted by squito <gi...@git.apache.org>.

Github user squito closed the pull request at:

    https://github.com/apache/spark/pull/16781


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112503018
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -141,4 +160,326 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
           Row(Seq(Row(1))),
           "ARRAY<STRUCT<array_element: INT>>")
       }
    +
    +  val testTimezones = Seq(
    +    "UTC" -> "UTC",
    +    "LA" -> "America/Los_Angeles",
    +    "Berlin" -> "Europe/Berlin"
    +  )
    +  // Check creating parquet tables with timestamps, writing data into them, and reading it back out
    +  // under a variety of conditions:
    +  // * tables with explicit tz and those without
    +  // * altering table properties directly
    +  // * variety of timezones, local & non-local
    +  val sessionTimezones = testTimezones.map(_._2).map(Some(_)) ++ Seq(None)
    +  sessionTimezones.foreach { sessionTzOpt =>
    +    val sparkSession = spark.newSession()
    +    sessionTzOpt.foreach { tz => sparkSession.conf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, tz) }
    +    testCreateWriteRead(sparkSession, "no_tz", None, sessionTzOpt)
    +    val localTz = TimeZone.getDefault.getID()
    +    testCreateWriteRead(sparkSession, "local", Some(localTz), sessionTzOpt)
    +    // check with a variety of timezones.  The unit tests currently are configured to always use
    +    // America/Los_Angeles, but even if they didn't, we'd be sure to cover a non-local timezone.
    +    Seq(
    +      "UTC" -> "UTC",
    +      "LA" -> "America/Los_Angeles",
    +      "Berlin" -> "Europe/Berlin"
    +    ).foreach { case (tableName, zone) =>
    +      if (zone != localTz) {
    +        testCreateWriteRead(sparkSession, tableName, Some(zone), sessionTzOpt)
    +      }
    +    }
    +  }
    +
    +  private def testCreateWriteRead(
    +      sparkSession: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    testCreateAlterTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testWriteTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testReadTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +  }
    +
    +  private def checkHasTz(table: String, tz: Option[String]): Unit = {
    +    val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(table))
    +    assert(tableMetadata.properties.get(ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY) === tz)
    +  }
    +
    +  private def testCreateAlterTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    test(s"SPARK-12297: Create and Alter Parquet tables and timezones; explicitTz = $explicitTz; " +
    +      s"sessionTzOpt = $sessionTzOpt") {
    +      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +      withTable(baseTable, s"like_$baseTable", s"select_$baseTable") {
    +        val localTz = TimeZone.getDefault()
    +        val localTzId = localTz.getID()
    +        // If we ever add a property to set the table timezone by default, defaultTz would change
    +        val defaultTz = None
    +        // check that created tables have correct TBLPROPERTIES
    +        val tblProperties = explicitTz.map {
    +          tz => raw"""TBLPROPERTIES ($key="$tz")"""
    +        }.getOrElse("")
    +        spark.sql(
    +          raw"""CREATE TABLE $baseTable (
    +                |  x int
    +                | )
    +                | STORED AS PARQUET
    +                | $tblProperties
    +            """.stripMargin)
    +        val expectedTableTz = explicitTz.orElse(defaultTz)
    +        checkHasTz(baseTable, expectedTableTz)
    +        spark.sql(s"CREATE TABLE like_$baseTable LIKE $baseTable")
    +        checkHasTz(s"like_$baseTable", expectedTableTz)
    +        spark.sql(
    +          raw"""CREATE TABLE select_$baseTable
    +                | STORED AS PARQUET
    +                | AS
    +                | SELECT * from $baseTable
    +            """.stripMargin)
    +        checkHasTz(s"select_$baseTable", defaultTz)
    +
    +        // check alter table, setting, unsetting, resetting the property
    +        spark.sql(
    +          raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="America/Los_Angeles")""")
    +        checkHasTz(baseTable, Some("America/Los_Angeles"))
    +        spark.sql( raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="UTC")""")
    +        checkHasTz(baseTable, Some("UTC"))
    +        spark.sql( raw"""ALTER TABLE $baseTable UNSET TBLPROPERTIES ($key)""")
    +        checkHasTz(baseTable, None)
    +        explicitTz.foreach { tz =>
    +          spark.sql( raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="$tz")""")
    +          checkHasTz(baseTable, expectedTableTz)
    +        }
    +      }
    +    }
    +  }
    +
    +  val desiredTimestampStrings = Seq(
    +    "2015-12-31 22:49:59.123",
    +    "2015-12-31 23:50:59.123",
    +    "2016-01-01 00:39:59.123",
    +    "2016-01-01 01:29:59.123"
    +  )
    +  // We don't want to mess with timezones inside the tests themselves, since we use a shared
    +  // spark context, and then we might be prone to issues from lazy vals for timezones.  Instead,
    +  // we manually adjust the timezone just to determine what the desired millis (since epoch, in utc)
    +  // is for various "wall-clock" times in different timezones, and then we can compare against those
    +  // in our tests.
    +  val originalTz = TimeZone.getDefault
    +  val timestampTimezoneToMillis = try {
    +    (for {
    +      timestampString <- desiredTimestampStrings
    +      timezone <- Seq("America/Los_Angeles", "Europe/Berlin", "UTC").map {
    +        TimeZone.getTimeZone(_)
    +      }
    +    } yield {
    +      TimeZone.setDefault(timezone)
    +      val timestamp = Timestamp.valueOf(timestampString)
    +      (timestampString, timezone.getID()) -> timestamp.getTime()
    +    }).toMap
    +  } finally {
    +    TimeZone.setDefault(originalTz)
    +  }
    +
    +  private def createRawData(spark: SparkSession): Dataset[(String, Timestamp)] = {
    +    val rowRdd = spark.sparkContext.parallelize(desiredTimestampStrings, 1).map(Row(_))
    +    val schema = StructType(Seq(
    +      StructField("display", StringType, true)
    +    ))
    +    val df = spark.createDataFrame(rowRdd, schema)
    --- End diff --
    
    thanks, appreciate the help simplifying this.  I had a feeling it was more complex than it needed to be :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r114477668
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -141,4 +152,373 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
           Row(Seq(Row(1))),
           "ARRAY<STRUCT<array_element: INT>>")
       }
    +
    +  val testTimezones = Seq(
    +    "UTC" -> "UTC",
    +    "LA" -> "America/Los_Angeles",
    +    "Berlin" -> "Europe/Berlin"
    +  )
    +  // Check creating parquet tables with timestamps, writing data into them, and reading it back out
    +  // under a variety of conditions:
    +  // * tables with explicit tz and those without
    +  // * altering table properties directly
    +  // * variety of timezones, local & non-local
    +  val sessionTimezones = testTimezones.map(_._2).map(Some(_)) ++ Seq(None)
    +  sessionTimezones.foreach { sessionTzOpt =>
    +    val sparkSession = spark.newSession()
    +    sessionTzOpt.foreach { tz => sparkSession.conf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, tz) }
    +    testCreateWriteRead(sparkSession, "no_tz", None, sessionTzOpt)
    +    val localTz = TimeZone.getDefault.getID()
    +    testCreateWriteRead(sparkSession, "local", Some(localTz), sessionTzOpt)
    +    // check with a variety of timezones.  The unit tests currently are configured to always use
    +    // America/Los_Angeles, but even if they didn't, we'd be sure to cover a non-local timezone.
    +    testTimezones.foreach { case (tableName, zone) =>
    +      if (zone != localTz) {
    +        testCreateWriteRead(sparkSession, tableName, Some(zone), sessionTzOpt)
    +      }
    +    }
    +  }
    +
    +  private def testCreateWriteRead(
    +      sparkSession: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    testCreateAlterTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testWriteTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testReadTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +  }
    +
    +  private def checkHasTz(spark: SparkSession, table: String, tz: Option[String]): Unit = {
    +    val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(table))
    +    assert(tableMetadata.properties.get(ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY) === tz)
    +  }
    +
    +  private def testCreateAlterTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    test(s"SPARK-12297: Create and Alter Parquet tables and timezones; explicitTz = $explicitTz; " +
    +      s"sessionTzOpt = $sessionTzOpt") {
    +      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +      withTable(baseTable, s"like_$baseTable", s"select_$baseTable", s"partitioned_$baseTable") {
    +        // If we ever add a property to set the table timezone by default, defaultTz would change
    +        val defaultTz = None
    +        // check that created tables have correct TBLPROPERTIES
    +        val tblProperties = explicitTz.map {
    +          tz => raw"""TBLPROPERTIES ($key="$tz")"""
    +        }.getOrElse("")
    +        spark.sql(
    +          raw"""CREATE TABLE $baseTable (
    +                |  x int
    +                | )
    +                | STORED AS PARQUET
    +                | $tblProperties
    +            """.stripMargin)
    +        val expectedTableTz = explicitTz.orElse(defaultTz)
    +        checkHasTz(spark, baseTable, expectedTableTz)
    +        spark.sql(
    +          raw"""CREATE TABLE partitioned_$baseTable (
    +                |  x int
    +                | )
    +                | PARTITIONED BY (y int)
    +                | STORED AS PARQUET
    +                | $tblProperties
    +            """.stripMargin)
    +        checkHasTz(spark, s"partitioned_$baseTable", expectedTableTz)
    +        spark.sql(s"CREATE TABLE like_$baseTable LIKE $baseTable")
    +        checkHasTz(spark, s"like_$baseTable", expectedTableTz)
    +        spark.sql(
    +          raw"""CREATE TABLE select_$baseTable
    +                | STORED AS PARQUET
    +                | AS
    +                | SELECT * from $baseTable
    +            """.stripMargin)
    +        checkHasTz(spark, s"select_$baseTable", defaultTz)
    +
    +        // check alter table, setting, unsetting, resetting the property
    +        spark.sql(
    +          raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="America/Los_Angeles")""")
    +        checkHasTz(spark, baseTable, Some("America/Los_Angeles"))
    +        spark.sql(raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="UTC")""")
    +        checkHasTz(spark, baseTable, Some("UTC"))
    +        spark.sql(raw"""ALTER TABLE $baseTable UNSET TBLPROPERTIES ($key)""")
    +        checkHasTz(spark, baseTable, None)
    +        explicitTz.foreach { tz =>
    +          spark.sql(raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="$tz")""")
    +          checkHasTz(spark, baseTable, expectedTableTz)
    +        }
    +      }
    +    }
    +  }
    +
    +  val desiredTimestampStrings = Seq(
    +    "2015-12-31 22:49:59.123",
    +    "2015-12-31 23:50:59.123",
    +    "2016-01-01 00:39:59.123",
    +    "2016-01-01 01:29:59.123"
    +  )
    +  // We don't want to mess with timezones inside the tests themselves, since we use a shared
    +  // spark context, and then we might be prone to issues from lazy vals for timezones.  Instead,
    +  // we manually adjust the timezone just to determine what the desired millis (since epoch, in utc)
    +  // is for various "wall-clock" times in different timezones, and then we can compare against those
    +  // in our tests.
    +  val timestampTimezoneToMillis = {
    +    val originalTz = TimeZone.getDefault
    +    try {
    +      (for {
    --- End diff --
    
    Let's use `flatMap { .. map { ... } }`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74688 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74688/testReport)** for PR 16781 at commit [`38e19cd`](https://github.com/apache/spark/commit/38e19cdd497992fb063cb39d1d65bde1622553e4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

https://github.com/apache/spark/pull/16781

pinging some potential reviewers:
@liancheng @yhuai for prior work in hive compatibility
@ueshin for work on timezone support

Note that the timezone support from SPARK-18350 is to some extent at odds with this. Its not a total conflict, because that is addressing how spark stores data in its own json & csv format (among others). But this is about making sure that *hive* tables are consistent between different formats. I am somewhat puzzled by those changes, as they seem to be changing the meaning of the "TIMESTAMP" datatype to be more like "TIMESTAMP WITH TIMEZONE" in other sql engines, but I'll also admit I'm out of my expertise here.

This is not the area of code I am most familiar with, happy to get feedback. One thing in particular I haven't addressed is predicate pushdown -- in particular, I'm not sure how to *test* predicate pushdown. I can poke at that some more, but I'd appreciate any guidance.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74459 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74459/testReport)** for PR 16781 at commit [`c242fb8`](https://github.com/apache/spark/commit/c242fb89e2861c81c44877a3acfded9073b9104b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #75101 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75101/testReport)** for PR 16781 at commit [`39f506c`](https://github.com/apache/spark/commit/39f506c81187b466809ebca21b7d1bc4c1e82f7a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #76141 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76141/testReport)** for PR 16781 at commit [`e31657a`](https://github.com/apache/spark/commit/e31657a1b65ab73d52651c4e0b018d457e44d47f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112366731
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -141,4 +160,326 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
           Row(Seq(Row(1))),
           "ARRAY<STRUCT<array_element: INT>>")
       }
    +
    +  val testTimezones = Seq(
    +    "UTC" -> "UTC",
    +    "LA" -> "America/Los_Angeles",
    +    "Berlin" -> "Europe/Berlin"
    +  )
    +  // Check creating parquet tables with timestamps, writing data into them, and reading it back out
    +  // under a variety of conditions:
    +  // * tables with explicit tz and those without
    +  // * altering table properties directly
    +  // * variety of timezones, local & non-local
    +  val sessionTimezones = testTimezones.map(_._2).map(Some(_)) ++ Seq(None)
    +  sessionTimezones.foreach { sessionTzOpt =>
    +    val sparkSession = spark.newSession()
    +    sessionTzOpt.foreach { tz => sparkSession.conf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, tz) }
    +    testCreateWriteRead(sparkSession, "no_tz", None, sessionTzOpt)
    +    val localTz = TimeZone.getDefault.getID()
    +    testCreateWriteRead(sparkSession, "local", Some(localTz), sessionTzOpt)
    +    // check with a variety of timezones.  The unit tests currently are configured to always use
    +    // America/Los_Angeles, but even if they didn't, we'd be sure to cover a non-local timezone.
    +    Seq(
    +      "UTC" -> "UTC",
    +      "LA" -> "America/Los_Angeles",
    +      "Berlin" -> "Europe/Berlin"
    +    ).foreach { case (tableName, zone) =>
    +      if (zone != localTz) {
    +        testCreateWriteRead(sparkSession, tableName, Some(zone), sessionTzOpt)
    +      }
    +    }
    +  }
    +
    +  private def testCreateWriteRead(
    +      sparkSession: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    testCreateAlterTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testWriteTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testReadTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +  }
    +
    +  private def checkHasTz(table: String, tz: Option[String]): Unit = {
    +    val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(table))
    +    assert(tableMetadata.properties.get(ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY) === tz)
    +  }
    +
    +  private def testCreateAlterTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    test(s"SPARK-12297: Create and Alter Parquet tables and timezones; explicitTz = $explicitTz; " +
    +      s"sessionTzOpt = $sessionTzOpt") {
    +      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +      withTable(baseTable, s"like_$baseTable", s"select_$baseTable") {
    +        val localTz = TimeZone.getDefault()
    +        val localTzId = localTz.getID()
    +        // If we ever add a property to set the table timezone by default, defaultTz would change
    +        val defaultTz = None
    +        // check that created tables have correct TBLPROPERTIES
    +        val tblProperties = explicitTz.map {
    +          tz => raw"""TBLPROPERTIES ($key="$tz")"""
    +        }.getOrElse("")
    +        spark.sql(
    +          raw"""CREATE TABLE $baseTable (
    +                |  x int
    +                | )
    +                | STORED AS PARQUET
    +                | $tblProperties
    +            """.stripMargin)
    +        val expectedTableTz = explicitTz.orElse(defaultTz)
    +        checkHasTz(baseTable, expectedTableTz)
    +        spark.sql(s"CREATE TABLE like_$baseTable LIKE $baseTable")
    +        checkHasTz(s"like_$baseTable", expectedTableTz)
    +        spark.sql(
    +          raw"""CREATE TABLE select_$baseTable
    +                | STORED AS PARQUET
    +                | AS
    +                | SELECT * from $baseTable
    +            """.stripMargin)
    +        checkHasTz(s"select_$baseTable", defaultTz)
    +
    +        // check alter table, setting, unsetting, resetting the property
    +        spark.sql(
    +          raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="America/Los_Angeles")""")
    +        checkHasTz(baseTable, Some("America/Los_Angeles"))
    +        spark.sql( raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="UTC")""")
    +        checkHasTz(baseTable, Some("UTC"))
    +        spark.sql( raw"""ALTER TABLE $baseTable UNSET TBLPROPERTIES ($key)""")
    +        checkHasTz(baseTable, None)
    +        explicitTz.foreach { tz =>
    +          spark.sql( raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="$tz")""")
    +          checkHasTz(baseTable, expectedTableTz)
    +        }
    +      }
    +    }
    +  }
    +
    +  val desiredTimestampStrings = Seq(
    +    "2015-12-31 22:49:59.123",
    +    "2015-12-31 23:50:59.123",
    +    "2016-01-01 00:39:59.123",
    +    "2016-01-01 01:29:59.123"
    +  )
    +  // We don't want to mess with timezones inside the tests themselves, since we use a shared
    +  // spark context, and then we might be prone to issues from lazy vals for timezones.  Instead,
    +  // we manually adjust the timezone just to determine what the desired millis (since epoch, in utc)
    +  // is for various "wall-clock" times in different timezones, and then we can compare against those
    +  // in our tests.
    +  val originalTz = TimeZone.getDefault
    +  val timestampTimezoneToMillis = try {
    --- End diff --
    
    Shall we initialize this in the block like this?
    
    ```scala
    val timestampTimezoneToMillis = {
      val originalTz = TimeZone.getDefault
      try {
        ...
      } finally {
        TimeZone.setDefault(originalTz)
      }
    }
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    @ueshin I've pushed an update which addresses your comments.  I also realized that partitioned tables weren't handled correctly!  I fixed that as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Do you mean the default `SQLConf.PARQUET_TABLE_INCLUDE_TIMEZONE` should be `false` for now?
    If so, I agree with it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75966/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76556/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74031 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74031/testReport)** for PR 16781 at commit [`2c8a228`](https://github.com/apache/spark/commit/2c8a22811f404c751841e9d9f2e8b22780d60f99).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL][POC] Hive compatibility for Pa...

Posted by squito <gi...@git.apache.org>.

GitHub user squito reopened a pull request:

    https://github.com/apache/spark/pull/16781

    [SPARK-12297][SQL][POC] Hive compatibility for Parquet Timestamps

    ## What changes were proposed in this pull request?
    
    Hive has very strange behavior when writing timestamps to parquet data.  It will always apply the conversion from the local timezone to UTC, which it then reverses when reading the data back.  For compatibility with Hive, Spark should provide an option for doing the same conversion, when necessary, based on table metadata.  This goes along with HIVE-12767.
    
    Note that the default for Spark remains unchanged; created tables are marked as UTC, which means the read and write path remains unchanged (and avoids slow timezone logic).  The major use case is that *legacy* tables written by hive can now be read in correctly by Spark, as long as the appropriate table properties are set.
    
    ## How was this patch tested?
    
    Added a unit test which creates tables, reads and writes data, under a variety of permutations (conf on whether or not to add a default tz to new tables, different explicit timezones, vectorized reading on and off).
    
    TODO
    * [ ] predicate pushdown

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/squito/spark SPARK-12297

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16781.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16781
    
----
commit 53d0744f8cdb9404bfe84f1e0154606d3442639c
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-01-27T02:37:27Z

    very basic test for adjusting read parquet data

commit 69a3c8cb6c4efb35d817c214a43b217daddade4b
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-01-27T20:18:17Z

    wip

commit 51e24f28359b807f46e93975941330d5d93e3875
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-01-31T02:55:09Z

    working version for non-vectorized read -- lots of garbage too

commit 7e618411c83002ca098526b0691f8b184295a216
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-01-31T19:03:34Z

    working for vectorized reads -- not sure about all code paths

commit 9fbde13cbc431ff955564a0695c7a1c3e64e158f
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-01T15:39:39Z

    more tests for write path

commit bac9eb0ed3a65fcdaab458ef3bd52aef5af01b68
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-01T20:23:07Z

    expand tests; fix some metastore interaction; cleanup a lot of garbage

commit 1b05978dc9ee1de5f6d1d7031510ab6b91a6e5b9
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-01T20:49:31Z

    more cleanup

commit b622d278d7a451846dcde28ed01c9618b7a00662
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-01T22:03:38Z

    handle bad timezones; include unit test

commit 0604403e0d67d59c9c586b72b340db5c157d817b
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-02T05:43:08Z

    write support; lots more unit tests

commit f45516da3ee5adf6300085a807b7acd4193cbb36
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-02T16:17:39Z

    add tests for alter table

commit d4511a68a881c0f2b1238d644e4e6fa1f5578154
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-02T16:25:52Z

    utc or gmt; cleanup

commit 223ce2c25b122707c64e4eda77a11bff71fd0cbe
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-02T16:27:13Z

    more cleanup

commit 5b49ae026044b46f0899a9e792e2b71733c4cb8a
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-02T18:02:31Z

    fix compatibility

commit 9ef60a4fdb0164f6eed75d40897fddce33f96c23
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-01T19:31:54Z

    Merge branch 'master' into SPARK-12297

commit 0b6883c944ed0400a62414ebd605d7114a2f135d
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-02T20:41:46Z

    Merge branch 'master' into SPARK-12297

commit 69b81425c9b68240ddc5411a83ba39ca7d1a74e3
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-03T03:05:06Z

    wip

commit 7ca2c864de3b1a34e2e77e72f9ae51cfada88d65
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-03T22:33:42Z

    fix

commit 6f982d30c7cd4f8ee6e28024c45dcaeaa72bd874
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-06T19:12:00Z

    fixes; passes tests now

commit 1ad2f8302ee0c9f1caa801b5a75ddf610930e299
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-06T19:14:55Z

    Merge branch 'master' into SPARK-12297

commit 2c8a22811f404c751841e9d9f2e8b22780d60f99
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-06T19:22:11Z

    fix merge

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112048338
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala ---
    @@ -261,18 +266,20 @@ private[parquet] class ParquetRowConverter(
     
           case TimestampType =>
             // TODO Implements `TIMESTAMP_MICROS` once parquet-mr has that.
    +        // If the table has a timezone property, apply the correct conversions.  See SPARK-12297.
    +        val sessionTz = TimeZone.getTimeZone(hadoopConf.get(SQLConf.SESSION_LOCAL_TIMEZONE.key))
    --- End diff --
    
    good catch -- this would fail if you bypassed a hive table and read directly with `spark.read.parquet()`.  I'll be sure to add a test case to cover it as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74542 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74542/testReport)** for PR 16781 at commit [`c87a573`](https://github.com/apache/spark/commit/c87a57323ac1264ed4528a8cd2e26d342ca117f6).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #76556 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76556/testReport)** for PR 16781 at commit [`2537437`](https://github.com/apache/spark/commit/2537437a0c1f022acc215518e4728d24f6f6cf97).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #75686 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75686/testReport)** for PR 16781 at commit [`75e8579`](https://github.com/apache/spark/commit/75e8579551c2f1ae7ae958853754fbbc5a589dd4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112144554
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -498,6 +498,11 @@ object DateTimeUtils {
         false
       }
     
    +  lazy val validTimezones = TimeZone.getAvailableIDs().toSet
    +  def isValidTimezone(timezoneId: String): Boolean = {
    +    validTimezones.contains(timezoneId)
    --- End diff --
    
    Ah, I see. Let's keep current work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    > Do you mean the default SQLConf.PARQUET_TABLE_INCLUDE_TIMEZONE should be false for now?
    If so, I agree with it.
    
    I was proposing something more drastic -- eliminating that conf entirely.  If you want your table to have the timezone property set, rather than using that conf, you include `TBLPROPERTIES ("parquet.mr.int96.write.zone"="[some timezone]")` in your ddl (either "create table" or "alter table")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74042/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    LGTM, pending Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #75966 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75966/testReport)** for PR 16781 at commit [`44a8bbb`](https://github.com/apache/spark/commit/44a8bbb17484a61dd984fcb451e3b1be8c539e9f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74542 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74542/testReport)** for PR 16781 at commit [`c87a573`](https://github.com/apache/spark/commit/c87a57323ac1264ed4528a8cd2e26d342ca117f6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r111891150
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala ---
    @@ -261,18 +266,20 @@ private[parquet] class ParquetRowConverter(
     
           case TimestampType =>
             // TODO Implements `TIMESTAMP_MICROS` once parquet-mr has that.
    +        // If the table has a timezone property, apply the correct conversions.  See SPARK-12297.
    +        val sessionTz = TimeZone.getTimeZone(hadoopConf.get(SQLConf.SESSION_LOCAL_TIMEZONE.key))
    --- End diff --
    
    Is this safe without checking `null`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112366489
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -141,4 +160,326 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
           Row(Seq(Row(1))),
           "ARRAY<STRUCT<array_element: INT>>")
       }
    +
    +  val testTimezones = Seq(
    +    "UTC" -> "UTC",
    +    "LA" -> "America/Los_Angeles",
    +    "Berlin" -> "Europe/Berlin"
    +  )
    +  // Check creating parquet tables with timestamps, writing data into them, and reading it back out
    +  // under a variety of conditions:
    +  // * tables with explicit tz and those without
    +  // * altering table properties directly
    +  // * variety of timezones, local & non-local
    +  val sessionTimezones = testTimezones.map(_._2).map(Some(_)) ++ Seq(None)
    +  sessionTimezones.foreach { sessionTzOpt =>
    +    val sparkSession = spark.newSession()
    +    sessionTzOpt.foreach { tz => sparkSession.conf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, tz) }
    +    testCreateWriteRead(sparkSession, "no_tz", None, sessionTzOpt)
    +    val localTz = TimeZone.getDefault.getID()
    +    testCreateWriteRead(sparkSession, "local", Some(localTz), sessionTzOpt)
    +    // check with a variety of timezones.  The unit tests currently are configured to always use
    +    // America/Los_Angeles, but even if they didn't, we'd be sure to cover a non-local timezone.
    +    Seq(
    +      "UTC" -> "UTC",
    +      "LA" -> "America/Los_Angeles",
    +      "Berlin" -> "Europe/Berlin"
    +    ).foreach { case (tableName, zone) =>
    +      if (zone != localTz) {
    +        testCreateWriteRead(sparkSession, tableName, Some(zone), sessionTzOpt)
    +      }
    +    }
    +  }
    +
    +  private def testCreateWriteRead(
    +      sparkSession: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    testCreateAlterTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testWriteTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testReadTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +  }
    +
    +  private def checkHasTz(table: String, tz: Option[String]): Unit = {
    +    val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(table))
    +    assert(tableMetadata.properties.get(ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY) === tz)
    +  }
    +
    +  private def testCreateAlterTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    test(s"SPARK-12297: Create and Alter Parquet tables and timezones; explicitTz = $explicitTz; " +
    +      s"sessionTzOpt = $sessionTzOpt") {
    +      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +      withTable(baseTable, s"like_$baseTable", s"select_$baseTable") {
    +        val localTz = TimeZone.getDefault()
    +        val localTzId = localTz.getID()
    +        // If we ever add a property to set the table timezone by default, defaultTz would change
    +        val defaultTz = None
    +        // check that created tables have correct TBLPROPERTIES
    +        val tblProperties = explicitTz.map {
    +          tz => raw"""TBLPROPERTIES ($key="$tz")"""
    +        }.getOrElse("")
    +        spark.sql(
    +          raw"""CREATE TABLE $baseTable (
    +                |  x int
    +                | )
    +                | STORED AS PARQUET
    +                | $tblProperties
    +            """.stripMargin)
    +        val expectedTableTz = explicitTz.orElse(defaultTz)
    +        checkHasTz(baseTable, expectedTableTz)
    +        spark.sql(s"CREATE TABLE like_$baseTable LIKE $baseTable")
    +        checkHasTz(s"like_$baseTable", expectedTableTz)
    +        spark.sql(
    +          raw"""CREATE TABLE select_$baseTable
    +                | STORED AS PARQUET
    +                | AS
    +                | SELECT * from $baseTable
    +            """.stripMargin)
    +        checkHasTz(s"select_$baseTable", defaultTz)
    +
    +        // check alter table, setting, unsetting, resetting the property
    +        spark.sql(
    +          raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="America/Los_Angeles")""")
    +        checkHasTz(baseTable, Some("America/Los_Angeles"))
    +        spark.sql( raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="UTC")""")
    +        checkHasTz(baseTable, Some("UTC"))
    +        spark.sql( raw"""ALTER TABLE $baseTable UNSET TBLPROPERTIES ($key)""")
    +        checkHasTz(baseTable, None)
    +        explicitTz.foreach { tz =>
    +          spark.sql( raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="$tz")""")
    +          checkHasTz(baseTable, expectedTableTz)
    +        }
    +      }
    +    }
    +  }
    +
    +  val desiredTimestampStrings = Seq(
    +    "2015-12-31 22:49:59.123",
    +    "2015-12-31 23:50:59.123",
    +    "2016-01-01 00:39:59.123",
    +    "2016-01-01 01:29:59.123"
    +  )
    +  // We don't want to mess with timezones inside the tests themselves, since we use a shared
    +  // spark context, and then we might be prone to issues from lazy vals for timezones.  Instead,
    +  // we manually adjust the timezone just to determine what the desired millis (since epoch, in utc)
    +  // is for various "wall-clock" times in different timezones, and then we can compare against those
    +  // in our tests.
    +  val originalTz = TimeZone.getDefault
    +  val timestampTimezoneToMillis = try {
    +    (for {
    +      timestampString <- desiredTimestampStrings
    +      timezone <- Seq("America/Los_Angeles", "Europe/Berlin", "UTC").map {
    +        TimeZone.getTimeZone(_)
    +      }
    +    } yield {
    +      TimeZone.setDefault(timezone)
    +      val timestamp = Timestamp.valueOf(timestampString)
    +      (timestampString, timezone.getID()) -> timestamp.getTime()
    +    }).toMap
    +  } finally {
    +    TimeZone.setDefault(originalTz)
    +  }
    +
    +  private def createRawData(spark: SparkSession): Dataset[(String, Timestamp)] = {
    +    val rowRdd = spark.sparkContext.parallelize(desiredTimestampStrings, 1).map(Row(_))
    +    val schema = StructType(Seq(
    +      StructField("display", StringType, true)
    +    ))
    +    val df = spark.createDataFrame(rowRdd, schema)
    --- End diff --
    
    We can use `val df = desiredTimestampStrings.toDF("display")` after `import spark.implicits._`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #76419 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76419/testReport)** for PR 16781 at commit [`2537437`](https://github.com/apache/spark/commit/2537437a0c1f022acc215518e4728d24f6f6cf97).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL][POC] Hive compatibility for Pa...

Posted by zivanfi <gi...@git.apache.org>.

Github user zivanfi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r104660877
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java ---
    @@ -89,11 +92,23 @@
     
       private final PageReader pageReader;
       private final ColumnDescriptor descriptor;
    +  private final TimeZone storageTz;
    +  private final TimeZone localTz = TimeZone.getDefault();
     
    -  public VectorizedColumnReader(ColumnDescriptor descriptor, PageReader pageReader)
    +  public VectorizedColumnReader(ColumnDescriptor descriptor, PageReader pageReader,
    +                                Configuration conf)
           throws IOException {
         this.descriptor = descriptor;
         this.pageReader = pageReader;
    +    // If the table has a timezone property, apply the correct conversions.  See SPARK-12297.
    +    // The conf is sometimes null in tests.
    +    String tzString =
    +        conf == null ? null : conf.get(ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY());
    +    if (tzString == null || tzString == "") {
    --- End diff --
    
    This is java code, not scala, you probably meant tzString.equals("") instead of tzString == ""


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL][POC] Hive compatibility for Pa...

Posted by zivanfi <gi...@git.apache.org>.

Github user zivanfi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r104664187
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -674,6 +674,12 @@ object SQLConf {
           .stringConf
           .createWithDefault(TimeZone.getDefault().getID())
     
    +  val PARQUET_TABLE_INCLUDE_TIMEZONE =
    +    buildConf("spark.sql.session.parquet.timeZone")
    +      .doc("""Enables inclusion of parquet timezone property in newly created parquet tables""")
    --- End diff --
    
    There should be a config option for writing "UTC" to the table property when creating tables, not for writing the local timezone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #76419 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76419/testReport)** for PR 16781 at commit [`2537437`](https://github.com/apache/spark/commit/2537437a0c1f022acc215518e4728d24f6f6cf97).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #75177 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75177/testReport)** for PR 16781 at commit [`a96806f`](https://github.com/apache/spark/commit/a96806fa4011e28dcc875d0fb519501013b91a1b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL][POC] Hive compatibility for Pa...

Posted by zivanfi <gi...@git.apache.org>.

Github user zivanfi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r104668320
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java ---
    @@ -89,11 +92,23 @@
     
       private final PageReader pageReader;
       private final ColumnDescriptor descriptor;
    +  private final TimeZone storageTz;
    +  private final TimeZone localTz = TimeZone.getDefault();
     
    -  public VectorizedColumnReader(ColumnDescriptor descriptor, PageReader pageReader)
    +  public VectorizedColumnReader(ColumnDescriptor descriptor, PageReader pageReader,
    +                                Configuration conf)
           throws IOException {
         this.descriptor = descriptor;
         this.pageReader = pageReader;
    +    // If the table has a timezone property, apply the correct conversions.  See SPARK-12297.
    +    // The conf is sometimes null in tests.
    +    String tzString =
    +        conf == null ? null : conf.get(ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY());
    +    if (tzString == null || tzString == "") {
    --- End diff --
    
    Or even better, isEmpty().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Thanks! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r111890989
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -498,6 +498,11 @@ object DateTimeUtils {
         false
       }
     
    +  lazy val validTimezones = TimeZone.getAvailableIDs().toSet
    +  def isValidTimezone(timezoneId: String): Boolean = {
    +    validTimezones.contains(timezoneId)
    --- End diff --
    
    Should be case insensitive?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75177/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74550 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74550/testReport)** for PR 16781 at commit [`db7e514`](https://github.com/apache/spark/commit/db7e5147c09d9f773e630982d02709ca06b3b638).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74459/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    @ueshin sorry it took me a while to figure out how a table partitioned by timestamps work (I didn't even realize that was possible, I don't think it is in hive?) and I was traveling.
    
    The good news is that partitioning by timestamp works just fine.  Since the ts is stored as a string anyway, and [converted using the session tz already](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L135), it already works.  I added one minimal test on this -- when the partitioned table is written, the correct partition dirs are created regardless of the timezone combinations.
    
    In particular, it doesn't make sense to do tests like the existing ones, where we write or read "unadjusted" data, bypassing the hive tables, and then make sure the right adjustments are applied when you perform the reverse action via the hive table; the partition values are correct whether you use the hive table & adjustment property or not.
    
    Let me know if you think more tests are required.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    thanks @ueshin ... I am going to chat w/ some folks involved in that hive patch, that was not my understanding conceptually of their patch.  I heard that there is a bug they need to fix so maybe its related.  I'll find out and update here appropriately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r111891280
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -192,6 +194,15 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
               fileFormatClass,
               None)
             val logicalRelation = cached.getOrElse {
    +          // We add the timezone to the relation options, which automatically gets injected into
    +          // the hadoopConf for the Parquet Converters
    +          val storageTzKey = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +          val storageTz = relation.tableMeta.properties.getOrElse(storageTzKey, "")
    +          val sessionTz = sparkSession.sessionState.conf.sessionLocalTimeZone
    +          val extraTzOptions = Map(
    +            storageTzKey -> storageTz,
    +            SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz
    --- End diff --
    
    Do we need to store session timezone?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112365371
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -233,6 +224,17 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
         result.copy(output = newOutput)
       }
     
    +  private def getStorageTzOptions(relation: CatalogRelation): Map[String, String] = {
    +    // We add the table timezone to the relation options, which automatically gets injected into the
    +    // hadoopConf for the Parquet Converters
    +    val storageTzKey = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +    val storageTz = relation.tableMeta.properties.getOrElse(storageTzKey, "")
    +    val sessionTz = sparkSession.sessionState.conf.sessionLocalTimeZone
    +    Map(
    +      storageTzKey -> storageTz
    +    )
    --- End diff --
    
    Should return `Map.empty` if the value isn't included to the table properties?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    @ueshin I've updated this to remove the conf entirely.  So updating your previous description, the new behavior can be described as:
    
    when creating table:
    
    * if a property `PARQUET_TIMEZONE_TABLE_PROPERTY` exists
        * include the table property `PARQUET_TIMEZONE_TABLE_PROPERTY`
        * use the `PARQUET_TIMEZONE_TABLE_PROPERTY` value
    * else
        * don't include table property `PARQUET_TIMEZONE_TABLE_PROPERTY`
    
    when writing/reading data:
    
    * if a table property `PARQUET_TIMEZONE_TABLE_PROPERTY` exists
        * use the `PARQUET_TIMEZONE_TABLE_PROPERTY` value to adjust timezone
    * else
        * don't adjust timezone


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    @ueshin thanks for taking a look.  Yes, I think your summary is exactly correct.
    
    Your suggestion of using the session timezone makes a lot of sense.  I can update the pr accordingly.  But I had another thought I wanted to run by you first -- perhaps in the initial version, we should *never* set the table property automatically?  Just leave it to users to set the property explicitly (in "create table" or "alter table")?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72287/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112042439
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ---
    @@ -498,6 +498,11 @@ object DateTimeUtils {
         false
       }
     
    +  lazy val validTimezones = TimeZone.getAvailableIDs().toSet
    +  def isValidTimezone(timezoneId: String): Boolean = {
    +    validTimezones.contains(timezoneId)
    --- End diff --
    
    Java does a case-sensitive check, which means Hive does too.  I don't think we want to write out a timezone w/ the wrong capitalization, and then have another tool throw an error.
    
    ```scala
    scala> val tzId = "America/Los_Angeles"
    tzId: String = America/Los_Angeles
    
    scala> java.util.TimeZone.getTimeZone(tzId).getID()
    res1: String = America/Los_Angeles
    
    scala> java.util.TimeZone.getTimeZone(tzId.toLowerCase()).getID()
    res2: String = GMT
    ```
    
    https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java#L167
    
    We could try to auto-convert the user's timezone to the correct capitilazation, but do you think that is worth it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76141/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74688/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74031/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by zivanfi <gi...@git.apache.org>.

Github user zivanfi commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Please update the pull request description, because the one dated Feb 2 does not correspond to the fix any more.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76391/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r114477132
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -141,4 +152,373 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
           Row(Seq(Row(1))),
           "ARRAY<STRUCT<array_element: INT>>")
       }
    +
    +  val testTimezones = Seq(
    +    "UTC" -> "UTC",
    +    "LA" -> "America/Los_Angeles",
    +    "Berlin" -> "Europe/Berlin"
    +  )
    +  // Check creating parquet tables with timestamps, writing data into them, and reading it back out
    +  // under a variety of conditions:
    +  // * tables with explicit tz and those without
    +  // * altering table properties directly
    +  // * variety of timezones, local & non-local
    +  val sessionTimezones = testTimezones.map(_._2).map(Some(_)) ++ Seq(None)
    +  sessionTimezones.foreach { sessionTzOpt =>
    +    val sparkSession = spark.newSession()
    +    sessionTzOpt.foreach { tz => sparkSession.conf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, tz) }
    +    testCreateWriteRead(sparkSession, "no_tz", None, sessionTzOpt)
    +    val localTz = TimeZone.getDefault.getID()
    +    testCreateWriteRead(sparkSession, "local", Some(localTz), sessionTzOpt)
    +    // check with a variety of timezones.  The unit tests currently are configured to always use
    +    // America/Los_Angeles, but even if they didn't, we'd be sure to cover a non-local timezone.
    +    testTimezones.foreach { case (tableName, zone) =>
    +      if (zone != localTz) {
    +        testCreateWriteRead(sparkSession, tableName, Some(zone), sessionTzOpt)
    +      }
    +    }
    +  }
    +
    +  private def testCreateWriteRead(
    +      sparkSession: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    testCreateAlterTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testWriteTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testReadTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +  }
    +
    +  private def checkHasTz(spark: SparkSession, table: String, tz: Option[String]): Unit = {
    +    val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(table))
    +    assert(tableMetadata.properties.get(ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY) === tz)
    +  }
    +
    +  private def testCreateAlterTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    test(s"SPARK-12297: Create and Alter Parquet tables and timezones; explicitTz = $explicitTz; " +
    +      s"sessionTzOpt = $sessionTzOpt") {
    +      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +      withTable(baseTable, s"like_$baseTable", s"select_$baseTable", s"partitioned_$baseTable") {
    +        // If we ever add a property to set the table timezone by default, defaultTz would change
    +        val defaultTz = None
    +        // check that created tables have correct TBLPROPERTIES
    +        val tblProperties = explicitTz.map {
    +          tz => raw"""TBLPROPERTIES ($key="$tz")"""
    --- End diff --
    
    Let's use `s""` instead of `raw""` if possible. And also elsewhere in the same way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #72287 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72287/testReport)** for PR 16781 at commit [`223ce2c`](https://github.com/apache/spark/commit/223ce2c25b122707c64e4eda77a11bff71fd0cbe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72288/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112365163
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -233,6 +224,17 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
         result.copy(output = newOutput)
       }
     
    +  private def getStorageTzOptions(relation: CatalogRelation): Map[String, String] = {
    +    // We add the table timezone to the relation options, which automatically gets injected into the
    +    // hadoopConf for the Parquet Converters
    +    val storageTzKey = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +    val storageTz = relation.tableMeta.properties.getOrElse(storageTzKey, "")
    +    val sessionTz = sparkSession.sessionState.conf.sessionLocalTimeZone
    --- End diff --
    
    `sessionTz` isn't used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76419/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74611 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74611/testReport)** for PR 16781 at commit [`d951443`](https://github.com/apache/spark/commit/d951443d7dddd187fe119f01cb9ee16459f4a346).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #75101 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75101/testReport)** for PR 16781 at commit [`39f506c`](https://github.com/apache/spark/commit/39f506c81187b466809ebca21b7d1bc4c1e82f7a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r111891389
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -141,4 +160,325 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
           Row(Seq(Row(1))),
           "ARRAY<STRUCT<array_element: INT>>")
       }
    +
    +  val testTimezones = Seq(
    +    "UTC" -> "UTC",
    +    "LA" -> "America/Los_Angeles",
    +    "Berlin" -> "Europe/Berlin"
    +  )
    +  // Check creating parquet tables with timestamps, writing data into them, and reading it back out
    +  // under a variety of conditions:
    +  // * tables with explicit tz and those without
    +  // * altering table properties directly
    +  // * variety of timezones, local & non-local
    +  val sessionTimezones = testTimezones.map(_._2).map(Some(_)) ++ Seq(None)
    +  sessionTimezones.foreach { sessionTzOpt =>
    +    val sparkSession = spark.newSession()
    +    sessionTzOpt.foreach { tz => sparkSession.conf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, tz) }
    +    testCreateWriteRead(sparkSession, "no_tz", None, sessionTzOpt)
    +    val localTz = TimeZone.getDefault.getID()
    +    testCreateWriteRead(sparkSession, "local", Some(localTz), sessionTzOpt)
    +    // check with a variety of timezones.  The unit tests currently are configured to always use
    +    // America/Los_Angeles, but even if they didn't, we'd be sure to cover a non-local timezone.
    +    Seq(
    +      "UTC" -> "UTC",
    +      "LA" -> "America/Los_Angeles",
    +      "Berlin" -> "Europe/Berlin"
    +    ).foreach { case (tableName, zone) =>
    +      if (zone != localTz) {
    +        testCreateWriteRead(sparkSession, tableName, Some(zone), sessionTzOpt)
    +      }
    +    }
    +  }
    +
    +  private def testCreateWriteRead(
    +      sparkSession: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    testCreateAlterTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testWriteTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testReadTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +  }
    +
    +  private def checkHasTz(table: String, tz: Option[String]): Unit = {
    +    val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(table))
    +    assert(tableMetadata.properties.get(ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY) === tz)
    +  }
    +
    +  private def testCreateAlterTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    test(s"SPARK-12297: Create and Alter Parquet tables and timezones; explicitTz = $explicitTz; " +
    +      s"sessionTzOpt = $sessionTzOpt") {
    +      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +      withTable(baseTable, s"like_$baseTable", s"select_$baseTable") {
    +        val localTz = TimeZone.getDefault()
    +        val localTzId = localTz.getID()
    +        // If we ever add a property to set the table timezone by default, defaultTz would change
    +        val defaultTz = None
    +        // check that created tables have correct TBLPROPERTIES
    +        val tblProperties = explicitTz.map {
    +          tz => raw"""TBLPROPERTIES ($key="$tz")"""
    +        }.getOrElse("")
    +        spark.sql(
    +          raw"""CREATE TABLE $baseTable (
    +                |  x int
    +                | )
    +                | STORED AS PARQUET
    +                | $tblProperties
    +            """.stripMargin)
    +        val expectedTableTz = explicitTz.orElse(defaultTz)
    +        checkHasTz(baseTable, expectedTableTz)
    +        spark.sql(s"CREATE TABLE like_$baseTable LIKE $baseTable")
    +        checkHasTz(s"like_$baseTable", expectedTableTz)
    +        spark.sql(
    +          raw"""CREATE TABLE select_$baseTable
    +                | STORED AS PARQUET
    +                | AS
    +                | SELECT * from $baseTable
    +            """.stripMargin)
    +        checkHasTz(s"select_$baseTable", defaultTz)
    +
    +        // check alter table, setting, unsetting, resetting the property
    +        spark.sql(
    +          raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="America/Los_Angeles")""")
    +        checkHasTz(baseTable, Some("America/Los_Angeles"))
    +        spark.sql( raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="UTC")""")
    +        checkHasTz(baseTable, Some("UTC"))
    +        spark.sql( raw"""ALTER TABLE $baseTable UNSET TBLPROPERTIES ($key)""")
    +        checkHasTz(baseTable, None)
    +        explicitTz.foreach { tz =>
    +          spark.sql( raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="$tz")""")
    +          checkHasTz(baseTable, expectedTableTz)
    +        }
    +      }
    +    }
    +  }
    +
    +  val desiredTimestampStrings = Seq(
    +    "2015-12-31 23:50:59.123",
    +    "2015-12-31 22:49:59.123",
    +    "2016-01-01 00:39:59.123",
    +    "2016-01-01 01:29:59.123"
    +  )
    +  // We don't want to mess with timezones inside the tests themselves, since we use a shared
    +  // spark context, and then we might be prone to issues from lazy vals for timezones.  Instead,
    +  // we manually adjust the timezone just to determine what the desired millis (since epoch, in utc)
    +  // is for various "wall-clock" times in different timezones, and then we can compare against those
    +  // in our tests.
    +  val originalTz = TimeZone.getDefault
    +  val timestampTimezoneToMillis = try {
    +    (for {
    +      timestampString <- desiredTimestampStrings
    +      timezone <- Seq("America/Los_Angeles", "Europe/Berlin", "UTC").map {
    +        TimeZone.getTimeZone(_)
    +      }
    +    } yield {
    +      TimeZone.setDefault(timezone)
    +      val timestamp = Timestamp.valueOf(timestampString)
    +      (timestampString, timezone.getID()) -> timestamp.getTime()
    +    }).toMap
    +  } finally {
    +    TimeZone.setDefault(originalTz)
    +  }
    +
    +  private def createRawData(spark: SparkSession): Dataset[(String, Timestamp)] = {
    +    val originalTsStrings = Seq(
    +      "2015-12-31 22:49:59.123",
    +      "2015-12-31 23:50:59.123",
    +      "2016-01-01 00:39:59.123",
    +      "2016-01-01 01:29:59.123"
    +    )
    +    val rowRdd = spark.sparkContext.parallelize(originalTsStrings, 1).map(Row(_))
    +    val schema = StructType(Seq(
    +      StructField("display", StringType, true)
    +    ))
    +    val df = spark.createDataFrame(rowRdd, schema)
    +    // this will get the millis corresponding to the display time given the current *session*
    +    // timezone.
    +    import spark.implicits._
    +    df.withColumn("ts", expr("cast(display as timestamp)")).map { row =>
    +      (row.getAs[String](0), row.getAs[Timestamp](1))
    +    }
    --- End diff --
    
    nit: `df.withColumn(...).as[(String, Timestamp)]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112502905
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -42,6 +52,15 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
            """.stripMargin)
       }
     
    +  override def afterEach(): Unit = {
    --- End diff --
    
    I was probably just being a little paranoid, perhaps I had missed a `withTable` somewhere.  In the current code, things work just fine if I remove them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    @squito I'm sorry for late reply.
    I was reviewing this pr again and current Hive behavior deeply.
    
    It seems the both behaviors are the same when the table property exists, but this is a little complicated, so I want to confirm if the following behavior is correct or not by the simple example:
    
    ```scala
    
    scala> spark.conf.set("spark.sql.session.timeZone", "JST")
    
    scala> sql("create table foo (display string, ts timestamp) stored as parquet tblproperties (parquet.mr.int96.write.zone='JST')")
    scala> sql("insert into foo values ('2017-01-01 00:00:00', cast('2017-01-01 00:00:00' as timestamp))")
    
    scala> sql("select * from foo").show(truncate = false)
    +-------------------+-------------------+
    |display            |ts                 |
    +-------------------+-------------------+
    |2017-01-01 00:00:00|2017-01-01 00:00:00|
    +-------------------+-------------------+
    
    
    scala> sql("create table bar (display string, ts timestamp) stored as parquet")
    scala> sql("insert into bar values ('2017-01-01 00:00:00', cast('2017-01-01 00:00:00' as timestamp))")
    
    scala> sql("select * from bar").show(truncate = false)
    +-------------------+-------------------+
    |display            |ts                 |
    +-------------------+-------------------+
    |2017-01-01 00:00:00|2017-01-01 00:00:00|
    +-------------------+-------------------+
    
    
    scala> spark.conf.set("spark.sql.session.timeZone", "PST")
    
    
    // ts is adjusted from JST to PST.
    scala> sql("select * from foo").show(truncate = false)
    +-------------------+-------------------+
    |display            |ts                 |
    +-------------------+-------------------+
    |2017-01-01 00:00:00|2017-01-01 00:00:00|
    +-------------------+-------------------+
    
    
    // ts isn't adjusted.
    scala> sql("select * from bar").show(truncate = false)
    +-------------------+-------------------+
    |display            |ts                 |
    +-------------------+-------------------+
    |2017-01-01 00:00:00|2016-12-31 07:00:00|
    +-------------------+-------------------+
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74609/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    great!  thanks @ueshin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #72288 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72288/testReport)** for PR 16781 at commit [`5b49ae0`](https://github.com/apache/spark/commit/5b49ae026044b46f0899a9e792e2b71733c4cb8a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112368447
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -141,4 +160,326 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
           Row(Seq(Row(1))),
           "ARRAY<STRUCT<array_element: INT>>")
       }
    +
    +  val testTimezones = Seq(
    +    "UTC" -> "UTC",
    +    "LA" -> "America/Los_Angeles",
    +    "Berlin" -> "Europe/Berlin"
    +  )
    +  // Check creating parquet tables with timestamps, writing data into them, and reading it back out
    +  // under a variety of conditions:
    +  // * tables with explicit tz and those without
    +  // * altering table properties directly
    +  // * variety of timezones, local & non-local
    +  val sessionTimezones = testTimezones.map(_._2).map(Some(_)) ++ Seq(None)
    +  sessionTimezones.foreach { sessionTzOpt =>
    +    val sparkSession = spark.newSession()
    +    sessionTzOpt.foreach { tz => sparkSession.conf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, tz) }
    +    testCreateWriteRead(sparkSession, "no_tz", None, sessionTzOpt)
    +    val localTz = TimeZone.getDefault.getID()
    +    testCreateWriteRead(sparkSession, "local", Some(localTz), sessionTzOpt)
    +    // check with a variety of timezones.  The unit tests currently are configured to always use
    +    // America/Los_Angeles, but even if they didn't, we'd be sure to cover a non-local timezone.
    +    Seq(
    +      "UTC" -> "UTC",
    +      "LA" -> "America/Los_Angeles",
    +      "Berlin" -> "Europe/Berlin"
    +    ).foreach { case (tableName, zone) =>
    --- End diff --
    
    Should be `testTimezones.foreach { ...`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112373593
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -141,4 +160,326 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
           Row(Seq(Row(1))),
           "ARRAY<STRUCT<array_element: INT>>")
       }
    +
    +  val testTimezones = Seq(
    +    "UTC" -> "UTC",
    +    "LA" -> "America/Los_Angeles",
    +    "Berlin" -> "Europe/Berlin"
    +  )
    +  // Check creating parquet tables with timestamps, writing data into them, and reading it back out
    +  // under a variety of conditions:
    +  // * tables with explicit tz and those without
    +  // * altering table properties directly
    +  // * variety of timezones, local & non-local
    +  val sessionTimezones = testTimezones.map(_._2).map(Some(_)) ++ Seq(None)
    +  sessionTimezones.foreach { sessionTzOpt =>
    +    val sparkSession = spark.newSession()
    +    sessionTzOpt.foreach { tz => sparkSession.conf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, tz) }
    +    testCreateWriteRead(sparkSession, "no_tz", None, sessionTzOpt)
    +    val localTz = TimeZone.getDefault.getID()
    +    testCreateWriteRead(sparkSession, "local", Some(localTz), sessionTzOpt)
    +    // check with a variety of timezones.  The unit tests currently are configured to always use
    +    // America/Los_Angeles, but even if they didn't, we'd be sure to cover a non-local timezone.
    +    Seq(
    +      "UTC" -> "UTC",
    +      "LA" -> "America/Los_Angeles",
    +      "Berlin" -> "Europe/Berlin"
    +    ).foreach { case (tableName, zone) =>
    +      if (zone != localTz) {
    +        testCreateWriteRead(sparkSession, tableName, Some(zone), sessionTzOpt)
    +      }
    +    }
    +  }
    +
    +  private def testCreateWriteRead(
    +      sparkSession: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    testCreateAlterTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testWriteTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testReadTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +  }
    +
    +  private def checkHasTz(table: String, tz: Option[String]): Unit = {
    +    val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(table))
    +    assert(tableMetadata.properties.get(ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY) === tz)
    +  }
    +
    +  private def testCreateAlterTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    test(s"SPARK-12297: Create and Alter Parquet tables and timezones; explicitTz = $explicitTz; " +
    +      s"sessionTzOpt = $sessionTzOpt") {
    +      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +      withTable(baseTable, s"like_$baseTable", s"select_$baseTable") {
    +        val localTz = TimeZone.getDefault()
    +        val localTzId = localTz.getID()
    --- End diff --
    
    `localTz` and `localTzId` aren't used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Btw, I noticed that if we decided to use `SQLConf.PARQUET_TABLE_INCLUDE_TIMEZONE`, the default `PARQUET_TIMEZONE_TABLE_PROPERTY` value should be `GMT` instead of session local timezone as Hive does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112373013
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -141,4 +160,326 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
           Row(Seq(Row(1))),
           "ARRAY<STRUCT<array_element: INT>>")
       }
    +
    +  val testTimezones = Seq(
    +    "UTC" -> "UTC",
    +    "LA" -> "America/Los_Angeles",
    +    "Berlin" -> "Europe/Berlin"
    +  )
    +  // Check creating parquet tables with timestamps, writing data into them, and reading it back out
    +  // under a variety of conditions:
    +  // * tables with explicit tz and those without
    +  // * altering table properties directly
    +  // * variety of timezones, local & non-local
    +  val sessionTimezones = testTimezones.map(_._2).map(Some(_)) ++ Seq(None)
    +  sessionTimezones.foreach { sessionTzOpt =>
    +    val sparkSession = spark.newSession()
    +    sessionTzOpt.foreach { tz => sparkSession.conf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, tz) }
    +    testCreateWriteRead(sparkSession, "no_tz", None, sessionTzOpt)
    +    val localTz = TimeZone.getDefault.getID()
    +    testCreateWriteRead(sparkSession, "local", Some(localTz), sessionTzOpt)
    +    // check with a variety of timezones.  The unit tests currently are configured to always use
    +    // America/Los_Angeles, but even if they didn't, we'd be sure to cover a non-local timezone.
    +    Seq(
    +      "UTC" -> "UTC",
    +      "LA" -> "America/Los_Angeles",
    +      "Berlin" -> "Europe/Berlin"
    +    ).foreach { case (tableName, zone) =>
    +      if (zone != localTz) {
    +        testCreateWriteRead(sparkSession, tableName, Some(zone), sessionTzOpt)
    +      }
    +    }
    +  }
    +
    +  private def testCreateWriteRead(
    +      sparkSession: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    testCreateAlterTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testWriteTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testReadTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +  }
    +
    +  private def checkHasTz(table: String, tz: Option[String]): Unit = {
    +    val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(table))
    --- End diff --
    
    Should explicitly pass `sparkSession` and use it here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75101/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74550 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74550/testReport)** for PR 16781 at commit [`db7e514`](https://github.com/apache/spark/commit/db7e5147c09d9f773e630982d02709ca06b3b638).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r113346208
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -397,13 +392,38 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
                 schema = new StructType().add("display", StringType).add("ts", TimestampType),
                 options = options
               )
    -          Seq(false, true).foreach { vectorized =>
    -            withClue(s"vectorized = $vectorized;") {
    +
    +          // also write out a partitioned table, to make sure we can access that correctly.
    +          // add a column we can partition by (value doesn't particularly matter).
    +          val partitionedData = adjustedRawData.withColumn("id", monotonicallyIncreasingId)
    +          partitionedData.write.partitionBy("id")
    +            .parquet(partitionedPath.getCanonicalPath)
    +          // unfortunately, catalog.createTable() doesn't let us specify partitioning, so just use
    +          // a "CREATE TABLE" stmt.
    +          val tblOpts = explicitTz.map { tz => raw"""TBLPROPERTIES ($key="$tz")""" }.getOrElse("")
    +          spark.sql(raw"""CREATE EXTERNAL TABLE partitioned_$baseTable (
    +                         |  display string,
    +                         |  ts timestamp
    +                         |)
    +                         |PARTITIONED BY (id bigint)
    --- End diff --
    
    We should test for the partitioned table like `PARTITIONED BY (ts timestamp)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    I checked HIVE-12767 and reviewed this pr roughly.
    And I collected my thoughts about the desired behavior of this issue, please correct me if I'm wrong.
    
    when creating table:
    
    - if `SQLConf.PARQUET_TABLE_INCLUDE_TIMEZONE` == true
      - if a property `PARQUET_TIMEZONE_TABLE_PROPERTY` exists
        - include the table property `PARQUET_TIMEZONE_TABLE_PROPERTY`
        - use the `PARQUET_TIMEZONE_TABLE_PROPERTY` value
      - else
        - include the table property `PARQUET_TIMEZONE_TABLE_PROPERTY`
        - use session local timezone as the default `PARQUET_TIMEZONE_TABLE_PROPERTY` value
    - else
      - if a property `PARQUET_TIMEZONE_TABLE_PROPERTY` exists
        - include the table property `PARQUET_TIMEZONE_TABLE_PROPERTY`
        - use the `PARQUET_TIMEZONE_TABLE_PROPERTY` value
      - else
        - don't include table property `PARQUET_TIMEZONE_TABLE_PROPERTY`
    
    when writing/reading data:
    
    - if a table property `PARQUET_TIMEZONE_TABLE_PROPERTY` exists
      - use the `PARQUET_TIMEZONE_TABLE_PROPERTY` value to adjust timezone
    - else
      - don't adjust timezone
    
    Timezone related expressions respect session local timezone now, so we should also use session local timezone as the default value of `PARQUET_TIMEZONE_TABLE_PROPERTY` instead of system timezone, i.e. use `sparkSession.sessionState.conf.sessionLocalTimeZone` instead of `TimeZone.getDefault()`.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r112373782
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -141,4 +160,326 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
           Row(Seq(Row(1))),
           "ARRAY<STRUCT<array_element: INT>>")
       }
    +
    +  val testTimezones = Seq(
    +    "UTC" -> "UTC",
    +    "LA" -> "America/Los_Angeles",
    +    "Berlin" -> "Europe/Berlin"
    +  )
    +  // Check creating parquet tables with timestamps, writing data into them, and reading it back out
    +  // under a variety of conditions:
    +  // * tables with explicit tz and those without
    +  // * altering table properties directly
    +  // * variety of timezones, local & non-local
    +  val sessionTimezones = testTimezones.map(_._2).map(Some(_)) ++ Seq(None)
    +  sessionTimezones.foreach { sessionTzOpt =>
    +    val sparkSession = spark.newSession()
    +    sessionTzOpt.foreach { tz => sparkSession.conf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, tz) }
    +    testCreateWriteRead(sparkSession, "no_tz", None, sessionTzOpt)
    +    val localTz = TimeZone.getDefault.getID()
    +    testCreateWriteRead(sparkSession, "local", Some(localTz), sessionTzOpt)
    +    // check with a variety of timezones.  The unit tests currently are configured to always use
    +    // America/Los_Angeles, but even if they didn't, we'd be sure to cover a non-local timezone.
    +    Seq(
    +      "UTC" -> "UTC",
    +      "LA" -> "America/Los_Angeles",
    +      "Berlin" -> "Europe/Berlin"
    +    ).foreach { case (tableName, zone) =>
    +      if (zone != localTz) {
    +        testCreateWriteRead(sparkSession, tableName, Some(zone), sessionTzOpt)
    +      }
    +    }
    +  }
    +
    +  private def testCreateWriteRead(
    +      sparkSession: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    testCreateAlterTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testWriteTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testReadTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +  }
    +
    +  private def checkHasTz(table: String, tz: Option[String]): Unit = {
    +    val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(table))
    +    assert(tableMetadata.properties.get(ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY) === tz)
    +  }
    +
    +  private def testCreateAlterTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    test(s"SPARK-12297: Create and Alter Parquet tables and timezones; explicitTz = $explicitTz; " +
    +      s"sessionTzOpt = $sessionTzOpt") {
    +      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +      withTable(baseTable, s"like_$baseTable", s"select_$baseTable") {
    +        val localTz = TimeZone.getDefault()
    +        val localTzId = localTz.getID()
    +        // If we ever add a property to set the table timezone by default, defaultTz would change
    +        val defaultTz = None
    +        // check that created tables have correct TBLPROPERTIES
    +        val tblProperties = explicitTz.map {
    +          tz => raw"""TBLPROPERTIES ($key="$tz")"""
    +        }.getOrElse("")
    +        spark.sql(
    +          raw"""CREATE TABLE $baseTable (
    +                |  x int
    +                | )
    +                | STORED AS PARQUET
    +                | $tblProperties
    +            """.stripMargin)
    +        val expectedTableTz = explicitTz.orElse(defaultTz)
    +        checkHasTz(baseTable, expectedTableTz)
    +        spark.sql(s"CREATE TABLE like_$baseTable LIKE $baseTable")
    +        checkHasTz(s"like_$baseTable", expectedTableTz)
    +        spark.sql(
    +          raw"""CREATE TABLE select_$baseTable
    +                | STORED AS PARQUET
    +                | AS
    +                | SELECT * from $baseTable
    +            """.stripMargin)
    +        checkHasTz(s"select_$baseTable", defaultTz)
    +
    +        // check alter table, setting, unsetting, resetting the property
    +        spark.sql(
    +          raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="America/Los_Angeles")""")
    +        checkHasTz(baseTable, Some("America/Los_Angeles"))
    +        spark.sql( raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="UTC")""")
    --- End diff --
    
    nit: remove extra white space, and two more below this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74042 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74042/testReport)** for PR 16781 at commit [`f0b89fd`](https://github.com/apache/spark/commit/f0b89fdca05d61bff7f79047b5ac72125bfd107e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL][POC] Hive compatibility for Pa...

Posted by zivanfi <gi...@git.apache.org>.

Github user zivanfi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r104673553
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -137,8 +141,190 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
       }
     
       test("SPARK-16344: array of struct with a single field named 'array_element'") {
    +
         testParquetHiveCompatibility(
           Row(Seq(Row(1))),
           "ARRAY<STRUCT<array_element: INT>>")
       }
    +
    +  test(s"SPARK-12297: Parquet Timestamp & Hive timezone") {
    --- End diff --
    
    I think it would be better to have separate test cases for adjustments when reading, adjustments when writing and setting the table property when creating tables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r113345272
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -17,14 +17,25 @@
     
     package org.apache.spark.sql.hive
     
    +import java.io.File
     import java.sql.Timestamp
    +import java.util.TimeZone
     
    -import org.apache.spark.sql.Row
    -import org.apache.spark.sql.execution.datasources.parquet.ParquetCompatibilityTest
    +import org.apache.hadoop.fs.{FileSystem, Path}
    +import org.apache.parquet.hadoop.ParquetFileReader
    +import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName
    +import org.scalatest.BeforeAndAfterEach
    +
    +import org.apache.spark.sql.{AnalysisException, Dataset, Row, SparkSession}
    +import org.apache.spark.sql.catalyst.TableIdentifier
    +import org.apache.spark.sql.execution.datasources.parquet.{ParquetCompatibilityTest, ParquetFileFormat}
    +import org.apache.spark.sql.functions._
     import org.apache.spark.sql.hive.test.TestHiveSingleton
     import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.sql.types.{StringType, StructField, StructType, TimestampType}
     
    -class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHiveSingleton {
    +class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHiveSingleton
    +    with BeforeAndAfterEach {
    --- End diff --
    
    We don't need `BeforeAndAfterEach` anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r111891049
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java ---
    @@ -90,11 +93,30 @@
     
       private final PageReader pageReader;
       private final ColumnDescriptor descriptor;
    +  private final TimeZone storageTz;
    +  private final TimeZone sessionTz;
     
    -  public VectorizedColumnReader(ColumnDescriptor descriptor, PageReader pageReader)
    +  public VectorizedColumnReader(ColumnDescriptor descriptor, PageReader pageReader,
    +                                Configuration conf)
           throws IOException {
         this.descriptor = descriptor;
         this.pageReader = pageReader;
    +    // If the table has a timezone property, apply the correct conversions.  See SPARK-12297.
    +    // The conf is sometimes null in tests.
    +    String sessionTzString =
    +        conf == null ? null : conf.get(SQLConf.SESSION_LOCAL_TIMEZONE().key());
    +    if (sessionTzString == null || sessionTzString.isEmpty()) {
    +      sessionTz = TimeZone.getDefault();
    --- End diff --
    
    Shall we use `DateTimeUtils.defaultTimeZone()`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #75966 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75966/testReport)** for PR 16781 at commit [`44a8bbb`](https://github.com/apache/spark/commit/44a8bbb17484a61dd984fcb451e3b1be8c539e9f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r111891207
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala ---
    @@ -673,12 +681,26 @@ private[parquet] object ParquetRowConverter {
         unscaled
       }
     
    -  def binaryToSQLTimestamp(binary: Binary): SQLTimestamp = {
    +  /**
    +   * Converts an int96 to a SQLTimestamp, given both the storage timezone and the local timezone.
    +   * The timestamp is really meant to be interpreted as a "floating time", but since we
    +   * actually store it as micros since epoch, why we have to apply a conversion when timezones
    +   * change.
    +   * @param binary
    +   * @return
    +   */
    --- End diff --
    
    Can you add `@fromTz`, `@toTz` and descriptions of these?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    @ueshin thanks for taking a look.  Yes, that understanding is correct.  Another way to think about it is to compare those same operations with different file formats, eg. textfile.  Those work more like parquet does *after* this patch.  I had that explanation in a comment on the jira -- I just updated the jira description to include it.
    
    I'll address your comments, they also are making me take a closer look at a couple of things.  I should push an update tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r114478594
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -141,4 +152,373 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
           Row(Seq(Row(1))),
           "ARRAY<STRUCT<array_element: INT>>")
       }
    +
    +  val testTimezones = Seq(
    +    "UTC" -> "UTC",
    +    "LA" -> "America/Los_Angeles",
    +    "Berlin" -> "Europe/Berlin"
    +  )
    +  // Check creating parquet tables with timestamps, writing data into them, and reading it back out
    +  // under a variety of conditions:
    +  // * tables with explicit tz and those without
    +  // * altering table properties directly
    +  // * variety of timezones, local & non-local
    +  val sessionTimezones = testTimezones.map(_._2).map(Some(_)) ++ Seq(None)
    +  sessionTimezones.foreach { sessionTzOpt =>
    +    val sparkSession = spark.newSession()
    +    sessionTzOpt.foreach { tz => sparkSession.conf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, tz) }
    +    testCreateWriteRead(sparkSession, "no_tz", None, sessionTzOpt)
    +    val localTz = TimeZone.getDefault.getID()
    +    testCreateWriteRead(sparkSession, "local", Some(localTz), sessionTzOpt)
    +    // check with a variety of timezones.  The unit tests currently are configured to always use
    +    // America/Los_Angeles, but even if they didn't, we'd be sure to cover a non-local timezone.
    +    testTimezones.foreach { case (tableName, zone) =>
    +      if (zone != localTz) {
    +        testCreateWriteRead(sparkSession, tableName, Some(zone), sessionTzOpt)
    +      }
    +    }
    +  }
    +
    +  private def testCreateWriteRead(
    +      sparkSession: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    testCreateAlterTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testWriteTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testReadTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +  }
    +
    +  private def checkHasTz(spark: SparkSession, table: String, tz: Option[String]): Unit = {
    +    val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(table))
    +    assert(tableMetadata.properties.get(ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY) === tz)
    +  }
    +
    +  private def testCreateAlterTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    test(s"SPARK-12297: Create and Alter Parquet tables and timezones; explicitTz = $explicitTz; " +
    +      s"sessionTzOpt = $sessionTzOpt") {
    +      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +      withTable(baseTable, s"like_$baseTable", s"select_$baseTable", s"partitioned_$baseTable") {
    +        // If we ever add a property to set the table timezone by default, defaultTz would change
    +        val defaultTz = None
    +        // check that created tables have correct TBLPROPERTIES
    +        val tblProperties = explicitTz.map {
    +          tz => raw"""TBLPROPERTIES ($key="$tz")"""
    +        }.getOrElse("")
    +        spark.sql(
    +          raw"""CREATE TABLE $baseTable (
    +                |  x int
    +                | )
    +                | STORED AS PARQUET
    +                | $tblProperties
    +            """.stripMargin)
    +        val expectedTableTz = explicitTz.orElse(defaultTz)
    +        checkHasTz(spark, baseTable, expectedTableTz)
    +        spark.sql(
    +          raw"""CREATE TABLE partitioned_$baseTable (
    +                |  x int
    +                | )
    +                | PARTITIONED BY (y int)
    +                | STORED AS PARQUET
    +                | $tblProperties
    +            """.stripMargin)
    +        checkHasTz(spark, s"partitioned_$baseTable", expectedTableTz)
    +        spark.sql(s"CREATE TABLE like_$baseTable LIKE $baseTable")
    +        checkHasTz(spark, s"like_$baseTable", expectedTableTz)
    +        spark.sql(
    +          raw"""CREATE TABLE select_$baseTable
    +                | STORED AS PARQUET
    +                | AS
    +                | SELECT * from $baseTable
    +            """.stripMargin)
    +        checkHasTz(spark, s"select_$baseTable", defaultTz)
    +
    +        // check alter table, setting, unsetting, resetting the property
    +        spark.sql(
    +          raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="America/Los_Angeles")""")
    +        checkHasTz(spark, baseTable, Some("America/Los_Angeles"))
    +        spark.sql(raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="UTC")""")
    +        checkHasTz(spark, baseTable, Some("UTC"))
    +        spark.sql(raw"""ALTER TABLE $baseTable UNSET TBLPROPERTIES ($key)""")
    +        checkHasTz(spark, baseTable, None)
    +        explicitTz.foreach { tz =>
    +          spark.sql(raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="$tz")""")
    +          checkHasTz(spark, baseTable, expectedTableTz)
    +        }
    +      }
    +    }
    +  }
    +
    +  val desiredTimestampStrings = Seq(
    +    "2015-12-31 22:49:59.123",
    +    "2015-12-31 23:50:59.123",
    +    "2016-01-01 00:39:59.123",
    +    "2016-01-01 01:29:59.123"
    +  )
    +  // We don't want to mess with timezones inside the tests themselves, since we use a shared
    +  // spark context, and then we might be prone to issues from lazy vals for timezones.  Instead,
    +  // we manually adjust the timezone just to determine what the desired millis (since epoch, in utc)
    +  // is for various "wall-clock" times in different timezones, and then we can compare against those
    +  // in our tests.
    +  val timestampTimezoneToMillis = {
    +    val originalTz = TimeZone.getDefault
    +    try {
    +      (for {
    +        timestampString <- desiredTimestampStrings
    +        timezone <- Seq("America/Los_Angeles", "Europe/Berlin", "UTC").map {
    +          TimeZone.getTimeZone(_)
    +        }
    +      } yield {
    +        TimeZone.setDefault(timezone)
    +        val timestamp = Timestamp.valueOf(timestampString)
    +        (timestampString, timezone.getID()) -> timestamp.getTime()
    +      }).toMap
    +    } finally {
    +      TimeZone.setDefault(originalTz)
    +    }
    +  }
    +
    +  private def createRawData(spark: SparkSession): Dataset[(String, Timestamp)] = {
    +    import spark.implicits._
    +    val df = desiredTimestampStrings.toDF("display")
    +    // this will get the millis corresponding to the display time given the current *session*
    +    // timezone.
    +    df.withColumn("ts", expr("cast(display as timestamp)")).as[(String, Timestamp)]
    +  }
    +
    +  private def testWriteTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]) : Unit = {
    +    val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +    test(s"SPARK-12297: Write to Parquet tables with Timestamps; explicitTz = $explicitTz; " +
    +        s"sessionTzOpt = $sessionTzOpt") {
    +
    +      withTable(s"saveAsTable_$baseTable", s"insert_$baseTable", s"partitioned_ts_$baseTable") {
    +        val sessionTzId = sessionTzOpt.getOrElse(TimeZone.getDefault().getID())
    +        // check that created tables have correct TBLPROPERTIES
    +        val tblProperties = explicitTz.map {
    +          tz => raw"""TBLPROPERTIES ($key="$tz")"""
    +        }.getOrElse("")
    +
    +
    --- End diff --
    
    nit: remove extra line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74609 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74609/testReport)** for PR 16781 at commit [`f4dca27`](https://github.com/apache/spark/commit/f4dca27e301c1ce94e7175c6a852a4e94cea0555).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74459 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74459/testReport)** for PR 16781 at commit [`c242fb8`](https://github.com/apache/spark/commit/c242fb89e2861c81c44877a3acfded9073b9104b).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74611/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    I looked over this pr and I noticed that you're using the `TimeZone.getDefault()` as the "local" timezone.
    Since the timestamp values are represented as "the number of micros since epoch" internally, we should use `GMT` as the "local" timezone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #76556 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76556/testReport)** for PR 16781 at commit [`2537437`](https://github.com/apache/spark/commit/2537437a0c1f022acc215518e4728d24f6f6cf97).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #72288 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72288/testReport)** for PR 16781 at commit [`5b49ae0`](https://github.com/apache/spark/commit/5b49ae026044b46f0899a9e792e2b71733c4cb8a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

https://github.com/apache/spark/pull/16781

@ueshin thanks for taking a look earlier, sorry it has taken me some time to update this.

Things to note since last time:

1) Hive has seen been updated in [HIVE-16231](https://issues.apache.org/jira/browse/HIVE-16231) to use the local timezone, not GMT, as the default for storing data. Really, this is the change that should have been in HIVE-12767 -- otherwise you lose backwards compatibility with old datasets.

2) This PR now uses the session time zone, rather than local timezone. There are tests to confirm that a mix of session timezone X storage timezone works correctly.

3) Predicate pushdown is handled. I actually didn't need to change the behavior at all, since predicates are never pushed to int96 -- but there are some tests that confirm this.

I'm sure there is some minor cleanup that could be done, but overall I think this is ready now. I'd appreciate if you take another look and any suggestions you can make.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r114478431
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -29,6 +29,8 @@ import org.apache.spark.sql.catalyst.{QualifiedTableName, TableIdentifier}
     import org.apache.spark.sql.catalyst.catalog._
     import org.apache.spark.sql.catalyst.plans.logical._
     import org.apache.spark.sql.execution.datasources._
    +import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
    +import org.apache.spark.sql.internal.SQLConf
    --- End diff --
    
    nit: unnecessary import.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Ah, I see, it looks good to me for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/16781


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16781: [SPARK-12297][SQL] Hive compatibility for Parquet...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16781#discussion_r114478818
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala ---
    @@ -141,4 +152,373 @@ class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHi
           Row(Seq(Row(1))),
           "ARRAY<STRUCT<array_element: INT>>")
       }
    +
    +  val testTimezones = Seq(
    +    "UTC" -> "UTC",
    +    "LA" -> "America/Los_Angeles",
    +    "Berlin" -> "Europe/Berlin"
    +  )
    +  // Check creating parquet tables with timestamps, writing data into them, and reading it back out
    +  // under a variety of conditions:
    +  // * tables with explicit tz and those without
    +  // * altering table properties directly
    +  // * variety of timezones, local & non-local
    +  val sessionTimezones = testTimezones.map(_._2).map(Some(_)) ++ Seq(None)
    +  sessionTimezones.foreach { sessionTzOpt =>
    +    val sparkSession = spark.newSession()
    +    sessionTzOpt.foreach { tz => sparkSession.conf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, tz) }
    +    testCreateWriteRead(sparkSession, "no_tz", None, sessionTzOpt)
    +    val localTz = TimeZone.getDefault.getID()
    +    testCreateWriteRead(sparkSession, "local", Some(localTz), sessionTzOpt)
    +    // check with a variety of timezones.  The unit tests currently are configured to always use
    +    // America/Los_Angeles, but even if they didn't, we'd be sure to cover a non-local timezone.
    +    testTimezones.foreach { case (tableName, zone) =>
    +      if (zone != localTz) {
    +        testCreateWriteRead(sparkSession, tableName, Some(zone), sessionTzOpt)
    +      }
    +    }
    +  }
    +
    +  private def testCreateWriteRead(
    +      sparkSession: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    testCreateAlterTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testWriteTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +    testReadTablesWithTimezone(sparkSession, baseTable, explicitTz, sessionTzOpt)
    +  }
    +
    +  private def checkHasTz(spark: SparkSession, table: String, tz: Option[String]): Unit = {
    +    val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(table))
    +    assert(tableMetadata.properties.get(ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY) === tz)
    +  }
    +
    +  private def testCreateAlterTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +    test(s"SPARK-12297: Create and Alter Parquet tables and timezones; explicitTz = $explicitTz; " +
    +      s"sessionTzOpt = $sessionTzOpt") {
    +      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +      withTable(baseTable, s"like_$baseTable", s"select_$baseTable", s"partitioned_$baseTable") {
    +        // If we ever add a property to set the table timezone by default, defaultTz would change
    +        val defaultTz = None
    +        // check that created tables have correct TBLPROPERTIES
    +        val tblProperties = explicitTz.map {
    +          tz => raw"""TBLPROPERTIES ($key="$tz")"""
    +        }.getOrElse("")
    +        spark.sql(
    +          raw"""CREATE TABLE $baseTable (
    +                |  x int
    +                | )
    +                | STORED AS PARQUET
    +                | $tblProperties
    +            """.stripMargin)
    +        val expectedTableTz = explicitTz.orElse(defaultTz)
    +        checkHasTz(spark, baseTable, expectedTableTz)
    +        spark.sql(
    +          raw"""CREATE TABLE partitioned_$baseTable (
    +                |  x int
    +                | )
    +                | PARTITIONED BY (y int)
    +                | STORED AS PARQUET
    +                | $tblProperties
    +            """.stripMargin)
    +        checkHasTz(spark, s"partitioned_$baseTable", expectedTableTz)
    +        spark.sql(s"CREATE TABLE like_$baseTable LIKE $baseTable")
    +        checkHasTz(spark, s"like_$baseTable", expectedTableTz)
    +        spark.sql(
    +          raw"""CREATE TABLE select_$baseTable
    +                | STORED AS PARQUET
    +                | AS
    +                | SELECT * from $baseTable
    +            """.stripMargin)
    +        checkHasTz(spark, s"select_$baseTable", defaultTz)
    +
    +        // check alter table, setting, unsetting, resetting the property
    +        spark.sql(
    +          raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="America/Los_Angeles")""")
    +        checkHasTz(spark, baseTable, Some("America/Los_Angeles"))
    +        spark.sql(raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="UTC")""")
    +        checkHasTz(spark, baseTable, Some("UTC"))
    +        spark.sql(raw"""ALTER TABLE $baseTable UNSET TBLPROPERTIES ($key)""")
    +        checkHasTz(spark, baseTable, None)
    +        explicitTz.foreach { tz =>
    +          spark.sql(raw"""ALTER TABLE $baseTable SET TBLPROPERTIES ($key="$tz")""")
    +          checkHasTz(spark, baseTable, expectedTableTz)
    +        }
    +      }
    +    }
    +  }
    +
    +  val desiredTimestampStrings = Seq(
    +    "2015-12-31 22:49:59.123",
    +    "2015-12-31 23:50:59.123",
    +    "2016-01-01 00:39:59.123",
    +    "2016-01-01 01:29:59.123"
    +  )
    +  // We don't want to mess with timezones inside the tests themselves, since we use a shared
    +  // spark context, and then we might be prone to issues from lazy vals for timezones.  Instead,
    +  // we manually adjust the timezone just to determine what the desired millis (since epoch, in utc)
    +  // is for various "wall-clock" times in different timezones, and then we can compare against those
    +  // in our tests.
    +  val timestampTimezoneToMillis = {
    +    val originalTz = TimeZone.getDefault
    +    try {
    +      (for {
    +        timestampString <- desiredTimestampStrings
    +        timezone <- Seq("America/Los_Angeles", "Europe/Berlin", "UTC").map {
    +          TimeZone.getTimeZone(_)
    +        }
    +      } yield {
    +        TimeZone.setDefault(timezone)
    +        val timestamp = Timestamp.valueOf(timestampString)
    +        (timestampString, timezone.getID()) -> timestamp.getTime()
    +      }).toMap
    +    } finally {
    +      TimeZone.setDefault(originalTz)
    +    }
    +  }
    +
    +  private def createRawData(spark: SparkSession): Dataset[(String, Timestamp)] = {
    +    import spark.implicits._
    +    val df = desiredTimestampStrings.toDF("display")
    +    // this will get the millis corresponding to the display time given the current *session*
    +    // timezone.
    +    df.withColumn("ts", expr("cast(display as timestamp)")).as[(String, Timestamp)]
    +  }
    +
    +  private def testWriteTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]) : Unit = {
    +    val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    +    test(s"SPARK-12297: Write to Parquet tables with Timestamps; explicitTz = $explicitTz; " +
    +        s"sessionTzOpt = $sessionTzOpt") {
    +
    +      withTable(s"saveAsTable_$baseTable", s"insert_$baseTable", s"partitioned_ts_$baseTable") {
    +        val sessionTzId = sessionTzOpt.getOrElse(TimeZone.getDefault().getID())
    +        // check that created tables have correct TBLPROPERTIES
    +        val tblProperties = explicitTz.map {
    +          tz => raw"""TBLPROPERTIES ($key="$tz")"""
    +        }.getOrElse("")
    +
    +
    +        val rawData = createRawData(spark)
    +        // Check writing data out.
    +        // We write data into our tables, and then check the raw parquet files to see whether
    +        // the correct conversion was applied.
    +        rawData.write.saveAsTable(s"saveAsTable_$baseTable")
    +        checkHasTz(spark, s"saveAsTable_$baseTable", None)
    +        spark.sql(
    +          raw"""CREATE TABLE insert_$baseTable (
    +                |  display string,
    +                |  ts timestamp
    +                | )
    +                | STORED AS PARQUET
    +                | $tblProperties
    +               """.stripMargin)
    +        checkHasTz(spark, s"insert_$baseTable", explicitTz)
    +        rawData.write.insertInto(s"insert_$baseTable")
    +        // no matter what, roundtripping via the table should leave the data unchanged
    +        val readFromTable = spark.table(s"insert_$baseTable").collect()
    +          .map { row => (row.getAs[String](0), row.getAs[Timestamp](1)).toString() }.sorted
    +        assert(readFromTable === rawData.collect().map(_.toString()).sorted)
    +
    +        // Now we load the raw parquet data on disk, and check if it was adjusted correctly.
    +        // Note that we only store the timezone in the table property, so when we read the
    +        // data this way, we're bypassing all of the conversion logic, and reading the raw
    +        // values in the parquet file.
    +        val onDiskLocation = spark.sessionState.catalog
    +          .getTableMetadata(TableIdentifier(s"insert_$baseTable")).location.getPath
    +        // we test reading the data back with and without the vectorized reader, to make sure we
    +        // haven't broken reading parquet from non-hive tables, with both readers.
    +        Seq(false, true).foreach { vectorized =>
    +          spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, vectorized)
    +          val readFromDisk = spark.read.parquet(onDiskLocation).collect()
    +          val storageTzId = explicitTz.getOrElse(sessionTzId)
    +          readFromDisk.foreach { row =>
    +            val displayTime = row.getAs[String](0)
    +            val millis = row.getAs[Timestamp](1).getTime()
    +            val expectedMillis = timestampTimezoneToMillis((displayTime, storageTzId))
    +            assert(expectedMillis === millis, s"Display time '$displayTime' was stored " +
    +              s"incorrectly with sessionTz = ${sessionTzOpt}; Got $millis, expected " +
    +              s"$expectedMillis (delta = ${millis - expectedMillis})")
    +          }
    +        }
    +
    +        // check tables partitioned by timestamps.  We don't compare the "raw" data in this case,
    +        // since they are adjusted even when we bypass the hive table.
    +        rawData.write.partitionBy("ts").saveAsTable(s"partitioned_ts_$baseTable")
    +        val partitionDiskLocation = spark.sessionState.catalog
    +          .getTableMetadata(TableIdentifier(s"partitioned_ts_$baseTable")).location.getPath
    +        // no matter what mix of timezones we use, the dirs should specify the value with the
    +        // same time we use for display.
    +        val parts = new File(partitionDiskLocation).list().collect {
    +          case name if name.startsWith("ts=") => URLDecoder.decode(name.stripPrefix("ts="))
    +        }.toSet
    +        assert(parts === desiredTimestampStrings.toSet)
    +      }
    +    }
    +  }
    +
    +  private def testReadTablesWithTimezone(
    +      spark: SparkSession,
    +      baseTable: String,
    +      explicitTz: Option[String],
    +      sessionTzOpt: Option[String]): Unit = {
    +      val key = ParquetFileFormat.PARQUET_TIMEZONE_TABLE_PROPERTY
    --- End diff --
    
    nit: indent


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #76141 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76141/testReport)** for PR 16781 at commit [`e31657a`](https://github.com/apache/spark/commit/e31657a1b65ab73d52651c4e0b018d457e44d47f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #74042 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74042/testReport)** for PR 16781 at commit [`f0b89fd`](https://github.com/apache/spark/commit/f0b89fdca05d61bff7f79047b5ac72125bfd107e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #76391 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76391/testReport)** for PR 16781 at commit [`fc17a2e`](https://github.com/apache/spark/commit/fc17a2edc8e46dd95019f8ece972610d40f3ed2c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ParquetHiveCompatibilitySuite extends ParquetCompatibilityTest with TestHiveSingleton `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    @ueshin updated per your feedback.
    
    I should have explained that the last update *did* handle partition tables (it added the second call to `getStorageTzOptions` in `HiveMetastoreCatalog`), though I didn't have any tests for it.  It took me a while to figure out how to do it, but this update does include tests for creating partitioned tables and reading from them.  (the tests are becoming huge, but I think its worth testing all of the permutations.) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75686/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74542/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74550/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    **[Test build #76391 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76391/testReport)** for PR 16781 at commit [`fc17a2e`](https://github.com/apache/spark/commit/fc17a2edc8e46dd95019f8ece972610d40f3ed2c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16781: [SPARK-12297][SQL][POC] Hive compatibility for Parquet T...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16781
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org