You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/11/09 01:38:28 UTC

[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/19702

    [SPARK-10365][SQL] Support Parquet logical type TIMESTAMP_MICROS

    ## What changes were proposed in this pull request?
    
    This PR makes Spark to be able to read Parquet TIMESTAMP_MICROS values, and add a new config to allow Spark to write timestamp values to parquet as TIMESTAMP_MICROS type.
    
    ## How was this patch tested?
    
    new test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark parquet

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19702.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19702
    
----
commit e2747548e196028deda6461cd5e8148eb7476172
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-11-08T21:28:35Z

    cleanup

commit 5ca8bb5904ec85c3c7bb73ab91b1004de5763627
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-11-09T01:01:13Z

    support TIMESTAMP_MICROS

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19702


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19702#discussion_r149940418
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala ---
    @@ -982,7 +941,7 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
         binaryAsString = true,
         int96AsTimestamp = false,
         writeLegacyParquetFormat = true,
    -    int64AsTimestampMillis = true)
    +    outputTimestampType = SQLConf.ParquetOutputTimestampType.TIMESTAMP_MILLIS)
    --- End diff --
    
    Should we add a test for `TIMESTAMP_MICROS` just in case?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83728/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    **[Test build #83728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83728/testReport)** for PR 19702 at commit [`d848546`](https://github.com/apache/spark/commit/d8485466835fc62e3d268a98a950f38c408e3d73).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19702#discussion_r150381847
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala ---
    @@ -281,4 +281,32 @@ class SQLConfSuite extends QueryTest with SharedSQLContext {
         assert(null == spark.conf.get("spark.sql.nonexistent", null))
         assert("<undefined>" == spark.conf.get("spark.sql.nonexistent", "<undefined>"))
       }
    +
    +  test("SPARK-10365: PARQUET_OUTPUT_TIMESTAMP_TYPE") {
    +    spark.sessionState.conf.clear()
    +
    +    // check default value
    +    assert(spark.sessionState.conf.parquetOutputTimestampType ==
    +      SQLConf.ParquetOutputTimestampType.INT96)
    +
    +    // PARQUET_INT64_AS_TIMESTAMP_MILLIS should be respected.
    +    spark.sessionState.conf.setConf(SQLConf.PARQUET_INT64_AS_TIMESTAMP_MILLIS, true)
    +    assert(spark.sessionState.conf.parquetOutputTimestampType ==
    +      SQLConf.ParquetOutputTimestampType.TIMESTAMP_MILLIS)
    +
    +    // PARQUET_OUTPUT_TIMESTAMP_TYPE has higher priority over PARQUET_INT64_AS_TIMESTAMP_MILLIS
    +    spark.sessionState.conf.setConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE, "timestamp_micros")
    +    assert(spark.sessionState.conf.parquetOutputTimestampType ==
    +      SQLConf.ParquetOutputTimestampType.TIMESTAMP_MICROS)
    +    spark.sessionState.conf.setConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE, "int96")
    +    assert(spark.sessionState.conf.parquetOutputTimestampType ==
    +      SQLConf.ParquetOutputTimestampType.INT96)
    +
    +    // test invalid conf value
    +    intercept[IllegalArgumentException] {
    +      spark.conf.set(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key, "invalid")
    +    }
    +
    +    spark.sessionState.conf.clear()
    --- End diff --
    
    This is the existing code style in this file, I think we can put it in `beforeEach` and `afterEach`, I'll investigate in follow-up.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    Will review it tomorrow.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    **[Test build #83722 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83722/testReport)** for PR 19702 at commit [`ba16a5e`](https://github.com/apache/spark/commit/ba16a5e6bf33f0a3b04026e7d4414c4cc7b960ee).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19702#discussion_r150378414
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java ---
    @@ -281,10 +283,11 @@ private void checkEndOfRowGroup() throws IOException {
               + rowsReturned + " out of " + totalRowCount);
         }
         List<ColumnDescriptor> columns = requestedSchema.getColumns();
    +    List<Type> types = requestedSchema.asGroupType().getFields();
         columnReaders = new VectorizedColumnReader[columns.size()];
         for (int i = 0; i < columns.size(); ++i) {
           if (missingColumns[i]) continue;
    -      columnReaders[i] = new VectorizedColumnReader(columns.get(i),
    +      columnReaders[i] = new VectorizedColumnReader(columns.get(i), types.get(i).getOriginalType(),
    --- End diff --
    
    Nit: indents.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    **[Test build #83673 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83673/testReport)** for PR 19702 at commit [`180c449`](https://github.com/apache/spark/commit/180c4497ff7ca3cb5a5c7e38ce792c76af8cf31d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19702#discussion_r150168485
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -1143,6 +1159,18 @@ class SQLConf extends Serializable with Logging {
     
       def isParquetINT64AsTimestampMillis: Boolean = getConf(PARQUET_INT64_AS_TIMESTAMP_MILLIS)
     
    +  def parquetOutputTimestampType: ParquetOutputTimestampType.Value = {
    +    val isOutputTimestampTypeSet = settings.containsKey(PARQUET_OUTPUT_TIMESTAMP_TYPE.key)
    +    if (!isOutputTimestampTypeSet && isParquetINT64AsTimestampMillis) {
    +      // If PARQUET_OUTPUT_TIMESTAMP_TYPE is not set and PARQUET_INT64_AS_TIMESTAMP_MILLIS is set,
    +      // respect PARQUET_INT64_AS_TIMESTAMP_MILLIS and use TIMESTAMP_MILLIS. Otherwise,
    +      // PARQUET_OUTPUT_TIMESTAMP_TYPE has higher priority.
    --- End diff --
    
    if `isParquetINT64AsTimestampMillis` is false, we will go to the else branch, and pick `PARQUET_OUTPUT_TIMESTAMP_TYPE`, which by default is INT96(the current behavior). Let me add a test.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19702#discussion_r149924096
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -428,15 +417,9 @@ object ParquetFileFormat extends Logging {
       private[parquet] def readSchema(
           footers: Seq[Footer], sparkSession: SparkSession): Option[StructType] = {
     
    -    def parseParquetSchema(schema: MessageType): StructType = {
    -      val converter = new ParquetSchemaConverter(
    -        sparkSession.sessionState.conf.isParquetBinaryAsString,
    -        sparkSession.sessionState.conf.isParquetBinaryAsString,
    -        sparkSession.sessionState.conf.writeLegacyParquetFormat,
    -        sparkSession.sessionState.conf.isParquetINT64AsTimestampMillis)
    -
    -      converter.convert(schema)
    -    }
    +    val converter = new ParquetToSparkSchemaConverter(
    +      sparkSession.sessionState.conf.isParquetBinaryAsString,
    +      sparkSession.sessionState.conf.isParquetBinaryAsString)
    --- End diff --
    
    good catch! It's an existing type ...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19702#discussion_r150378160
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java ---
    @@ -91,11 +92,13 @@
     
       private final PageReader pageReader;
       private final ColumnDescriptor descriptor;
    +  private final OriginalType originalType;
     
    -  public VectorizedColumnReader(ColumnDescriptor descriptor, PageReader pageReader)
    +  public VectorizedColumnReader(ColumnDescriptor descriptor, OriginalType t, PageReader pageReader)
    --- End diff --
    
    `t` -> `originalType` or `logicalType`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19702#discussion_r149847655
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala ---
    @@ -30,49 +30,31 @@ import org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter
     import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types._
     
    +
     /**
    - * This converter class is used to convert Parquet [[MessageType]] to Spark SQL [[StructType]] and
    - * vice versa.
    + * This converter class is used to convert Parquet [[MessageType]] to Spark SQL [[StructType]].
      *
      * Parquet format backwards-compatibility rules are respected when converting Parquet
      * [[MessageType]] schemas.
      *
      * @see https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
    - * @constructor
    + *
      * @param assumeBinaryIsString Whether unannotated BINARY fields should be assumed to be Spark SQL
    - *        [[StringType]] fields when converting Parquet a [[MessageType]] to Spark SQL
    - *        [[StructType]].  This argument only affects Parquet read path.
    + *        [[StringType]] fields.
      * @param assumeInt96IsTimestamp Whether unannotated INT96 fields should be assumed to be Spark SQL
    - *        [[TimestampType]] fields when converting Parquet a [[MessageType]] to Spark SQL
    - *        [[StructType]].  Note that Spark SQL [[TimestampType]] is similar to Hive timestamp, which
    - *        has optional nanosecond precision, but different from `TIME_MILLS` and `TIMESTAMP_MILLIS`
    - *        described in Parquet format spec.  This argument only affects Parquet read path.
    - * @param writeLegacyParquetFormat Whether to use legacy Parquet format compatible with Spark 1.4
    - *        and prior versions when converting a Catalyst [[StructType]] to a Parquet [[MessageType]].
    - *        When set to false, use standard format defined in parquet-format spec.  This argument only
    - *        affects Parquet write path.
    - * @param writeTimestampInMillis Whether to write timestamp values as INT64 annotated by logical
    - *        type TIMESTAMP_MILLIS.
    - *
    + *        [[TimestampType]] fields.
      */
    -private[parquet] class ParquetSchemaConverter(
    +class ParquetToSparkSchemaConverter(
    --- End diff --
    
    split it into 2 classes, to make it clear that which configs are for reading and which are for writing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    Is it available in parquet 1.8.2? that's the version Spark currently use.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    cc @liancheng  @ueshin  @gatorsmile


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19702#discussion_r149863117
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -428,15 +417,9 @@ object ParquetFileFormat extends Logging {
       private[parquet] def readSchema(
           footers: Seq[Footer], sparkSession: SparkSession): Option[StructType] = {
     
    -    def parseParquetSchema(schema: MessageType): StructType = {
    -      val converter = new ParquetSchemaConverter(
    -        sparkSession.sessionState.conf.isParquetBinaryAsString,
    -        sparkSession.sessionState.conf.isParquetBinaryAsString,
    -        sparkSession.sessionState.conf.writeLegacyParquetFormat,
    -        sparkSession.sessionState.conf.isParquetINT64AsTimestampMillis)
    -
    -      converter.convert(schema)
    -    }
    +    val converter = new ParquetToSparkSchemaConverter(
    +      sparkSession.sessionState.conf.isParquetBinaryAsString,
    +      sparkSession.sessionState.conf.isParquetBinaryAsString)
    --- End diff --
    
    `sparkSession.sessionState.conf.isParquetINT96AsTimestamp`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    **[Test build #83673 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83673/testReport)** for PR 19702 at commit [`180c449`](https://github.com/apache/spark/commit/180c4497ff7ca3cb5a5c7e38ce792c76af8cf31d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19702#discussion_r150131141
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -1143,6 +1159,18 @@ class SQLConf extends Serializable with Logging {
     
       def isParquetINT64AsTimestampMillis: Boolean = getConf(PARQUET_INT64_AS_TIMESTAMP_MILLIS)
     
    +  def parquetOutputTimestampType: ParquetOutputTimestampType.Value = {
    +    val isOutputTimestampTypeSet = settings.containsKey(PARQUET_OUTPUT_TIMESTAMP_TYPE.key)
    +    if (!isOutputTimestampTypeSet && isParquetINT64AsTimestampMillis) {
    +      // If PARQUET_OUTPUT_TIMESTAMP_TYPE is not set and PARQUET_INT64_AS_TIMESTAMP_MILLIS is set,
    +      // respect PARQUET_INT64_AS_TIMESTAMP_MILLIS and use TIMESTAMP_MILLIS. Otherwise,
    +      // PARQUET_OUTPUT_TIMESTAMP_TYPE has higher priority.
    --- End diff --
    
    BTW, do we have a simple test for this priority? seems `isParquetINT64AsTimestampMillis` defaults to `false`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    The only failed test is SQLConfSuite which test the default value of the new config.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    ok sorry for taking a bit of time to look at this.  tl;dr I think what you are proposing here is fine.
    
    So to be honest, I'm a bit confused about the parquet versioning -- the types I was mentioned were added here: https://github.com/apache/parquet-format/commit/863875e0be3237c6aa4ed71733d54c91a51deabe#diff-0f9d1b5347959e15259da7ba8f4b6252
    to parquet-format v2.4.0
    
    I can't really tell what version of parquet-mr that corresponds to.  I can at least see that parquet-mr v1.8.2 doesn't have the new types.  I don't even think its master.  But IIUC, the latest version of parquet-format introduces new types, independent of the old TIMESTAMP_MILLIS and TIMESTAMP_MICROS.  So there won't be any compatibility issues when `isAdjustedToUTC` is added, as that will be in an entirely new type.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by ash211 <gi...@git.apache.org>.
Github user ash211 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19702#discussion_r151569727
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala ---
    @@ -372,23 +381,18 @@ private[parquet] class ParquetSchemaConverter(
           // `TIMESTAMP_MICROS` which are both logical types annotating `INT64`.
           //
           // Originally, Spark SQL uses the same nanosecond timestamp type as Impala and Hive.  Starting
    -      // from Spark 1.5.0, we resort to a timestamp type with 100 ns precision so that we can store
    -      // a timestamp into a `Long`.  This design decision is subject to change though, for example,
    -      // we may resort to microsecond precision in the future.
    -      //
    -      // For Parquet, we plan to write all `TimestampType` value as `TIMESTAMP_MICROS`, but it's
    -      // currently not implemented yet because parquet-mr 1.8.1 (the version we're currently using)
    -      // hasn't implemented `TIMESTAMP_MICROS` yet, however it supports TIMESTAMP_MILLIS. We will
    -      // encode timestamp values as TIMESTAMP_MILLIS annotating INT64 if
    -      // 'spark.sql.parquet.int64AsTimestampMillis' is set.
    -      //
    -      // TODO Converts `TIMESTAMP_MICROS` once parquet-mr implements that.
    -
    -      case TimestampType if writeTimestampInMillis =>
    -        Types.primitive(INT64, repetition).as(TIMESTAMP_MILLIS).named(field.name)
    -
    +      // from Spark 1.5.0, we resort to a timestamp type with microsecond precision so that we can
    --- End diff --
    
    I think this comment change loses some historical context about behavior around 1.5.0.  From 1.5.0 to 2.2.0 timestamps were with 100ns precision, and now with this change, starting in 2.3.0 they are microsecond precision, right?
    
    The change as written seems to retroactively change the described behavior, unless comment was wrong


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19702#discussion_r151570606
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala ---
    @@ -372,23 +381,18 @@ private[parquet] class ParquetSchemaConverter(
           // `TIMESTAMP_MICROS` which are both logical types annotating `INT64`.
           //
           // Originally, Spark SQL uses the same nanosecond timestamp type as Impala and Hive.  Starting
    -      // from Spark 1.5.0, we resort to a timestamp type with 100 ns precision so that we can store
    -      // a timestamp into a `Long`.  This design decision is subject to change though, for example,
    -      // we may resort to microsecond precision in the future.
    -      //
    -      // For Parquet, we plan to write all `TimestampType` value as `TIMESTAMP_MICROS`, but it's
    -      // currently not implemented yet because parquet-mr 1.8.1 (the version we're currently using)
    -      // hasn't implemented `TIMESTAMP_MICROS` yet, however it supports TIMESTAMP_MILLIS. We will
    -      // encode timestamp values as TIMESTAMP_MILLIS annotating INT64 if
    -      // 'spark.sql.parquet.int64AsTimestampMillis' is set.
    -      //
    -      // TODO Converts `TIMESTAMP_MICROS` once parquet-mr implements that.
    -
    -      case TimestampType if writeTimestampInMillis =>
    -        Types.primitive(INT64, repetition).as(TIMESTAMP_MILLIS).named(field.name)
    -
    +      // from Spark 1.5.0, we resort to a timestamp type with microsecond precision so that we can
    --- End diff --
    
    I don't think Spark ever have 100ns precision, cc @liancheng to confirm.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83722/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    hey thanks for doing this @cloud-fan but I have a small request -- can we get another day to review how this works, especially in connection with somewhat recent changes in parquet to include a [`isAdjustedToUTC`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L271)?  Just want to make sure this doesn't cause problems with resolving with / without time zone in parquet data later on.  (don't think it should, just want to take a bit closer look)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83641/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    **[Test build #83728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83728/testReport)** for PR 19702 at commit [`d848546`](https://github.com/apache/spark/commit/d8485466835fc62e3d268a98a950f38c408e3d73).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    **[Test build #83641 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83641/testReport)** for PR 19702 at commit [`af62d30`](https://github.com/apache/spark/commit/af62d301ee9d2f3f9ed0a5797110b6388b78f3e6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19702#discussion_r150378568
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala ---
    @@ -260,7 +267,6 @@ private[parquet] class ParquetRowConverter(
             }
     
           case TimestampType =>
    --- End diff --
    
    Write a comment here to explain it is for Int96?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83673/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    **[Test build #83722 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83722/testReport)** for PR 19702 at commit [`ba16a5e`](https://github.com/apache/spark/commit/ba16a5e6bf33f0a3b04026e7d4414c4cc7b960ee).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83643/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    **[Test build #83643 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83643/testReport)** for PR 19702 at commit [`e10c806`](https://github.com/apache/spark/commit/e10c8062e3df5b5caa784b0c10ccd92cf56099d2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    LGTM pending tests.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    **[Test build #83616 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83616/testReport)** for PR 19702 at commit [`5ca8bb5`](https://github.com/apache/spark/commit/5ca8bb5904ec85c3c7bb73ab91b1004de5763627).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    **[Test build #83616 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83616/testReport)** for PR 19702 at commit [`5ca8bb5`](https://github.com/apache/spark/commit/5ca8bb5904ec85c3c7bb73ab91b1004de5763627).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    LGTM pending Jenkins


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19702#discussion_r150378883
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -285,8 +285,24 @@ object SQLConf {
         .booleanConf
         .createWithDefault(true)
     
    +  object ParquetOutputTimestampType extends Enumeration {
    +    val INT96, TIMESTAMP_MICROS, TIMESTAMP_MILLIS = Value
    +  }
    +
    +  val PARQUET_OUTPUT_TIMESTAMP_TYPE = buildConf("spark.sql.parquet.outputTimestampType")
    +    .doc("Sets which Parquet timestamp type to use when Spark writes data to Parquet files. " +
    +      "INT96 is a non-standard but commonly used timestamp type in Parquet. TIMESTAMP_MICROS " +
    +      "is a standard timestamp type in Parquet, which stores number of microseconds from the " +
    +      "Unix epoch. TIMESTAMP_MILLIS is also standard, but with millisecond precision, which " +
    +      "means Spark has to truncate the microsecond portion of its timestamp value.")
    +    .stringConf
    +    .transform(_.toUpperCase(Locale.ROOT))
    +    .checkValues(ParquetOutputTimestampType.values.map(_.toString))
    +    .createWithDefault(ParquetOutputTimestampType.INT96.toString)
    --- End diff --
    
    To ensure no bug or test case issues, run all the test cases after changing the default? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    **[Test build #83641 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83641/testReport)** for PR 19702 at commit [`af62d30`](https://github.com/apache/spark/commit/af62d301ee9d2f3f9ed0a5797110b6388b78f3e6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19702#discussion_r150378929
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala ---
    @@ -281,4 +281,32 @@ class SQLConfSuite extends QueryTest with SharedSQLContext {
         assert(null == spark.conf.get("spark.sql.nonexistent", null))
         assert("<undefined>" == spark.conf.get("spark.sql.nonexistent", "<undefined>"))
       }
    +
    +  test("SPARK-10365: PARQUET_OUTPUT_TIMESTAMP_TYPE") {
    +    spark.sessionState.conf.clear()
    +
    +    // check default value
    +    assert(spark.sessionState.conf.parquetOutputTimestampType ==
    +      SQLConf.ParquetOutputTimestampType.INT96)
    +
    +    // PARQUET_INT64_AS_TIMESTAMP_MILLIS should be respected.
    +    spark.sessionState.conf.setConf(SQLConf.PARQUET_INT64_AS_TIMESTAMP_MILLIS, true)
    +    assert(spark.sessionState.conf.parquetOutputTimestampType ==
    +      SQLConf.ParquetOutputTimestampType.TIMESTAMP_MILLIS)
    +
    +    // PARQUET_OUTPUT_TIMESTAMP_TYPE has higher priority over PARQUET_INT64_AS_TIMESTAMP_MILLIS
    +    spark.sessionState.conf.setConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE, "timestamp_micros")
    +    assert(spark.sessionState.conf.parquetOutputTimestampType ==
    +      SQLConf.ParquetOutputTimestampType.TIMESTAMP_MICROS)
    +    spark.sessionState.conf.setConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE, "int96")
    +    assert(spark.sessionState.conf.parquetOutputTimestampType ==
    +      SQLConf.ParquetOutputTimestampType.INT96)
    +
    +    // test invalid conf value
    +    intercept[IllegalArgumentException] {
    +      spark.conf.set(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key, "invalid")
    +    }
    +
    +    spark.sessionState.conf.clear()
    --- End diff --
    
    ```
    try {
      ...
    } finally {
      spark.sessionState.conf.clear()
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    **[Test build #83643 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83643/testReport)** for PR 19702 at commit [`e10c806`](https://github.com/apache/spark/commit/e10c8062e3df5b5caa784b0c10ccd92cf56099d2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19702
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83616/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org