You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/02/01 01:21:10 UTC

[GitHub] [hudi] zuyanton opened a new issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

zuyanton opened a new issue #2509:
URL: https://github.com/apache/hudi/issues/2509


   **Describe the problem you faced**
   
   It looks like org.apache.spark.sql.types.TimestampType when saved to hudi table gets converted to bigInt
   
   **To Reproduce**
   
   create dataframe with TimestampType  
   ```
   var seq = Seq((1, "2020-01-01 11:22:30", 2, 2))
   var df = seq.toDF("pk", "time_string" , "partition", "sort_key")
   df= df.withColumn("timestamp", col("time_string").cast(TimestampType))
   ```  
   preview dataframe 
   ```
   df.show
   ```
   ```
   +---+-------------------+---------+--------+-------------------+
   | pk|        time_string|partition|sort_key|          timestamp|
   +---+-------------------+---------+--------+-------------------+
   |  1|2020-01-01 11:22:30|        2|       2|2020-01-01 11:22:30|
   +---+-------------------+---------+--------+-------------------+
   ```
   save dataframe to hudi table 
   ```
   df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://location")
   ```  
   view hudi table   
   ```
   spark.sql("select * from testTable2").show
   ```
   result, timestamp column is bigint   
   ```
   +-------------------+--------------------+------------------+----------------------+--------------------+---+-------------------+--------+----------------+---------+
   |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| pk|        time_string|sort_key|       timestamp|partition|
   +-------------------+--------------------+------------------+----------------------+--------------------+---+-------------------+--------+----------------+---------+
   |     20210201004527|  20210201004527_0_1|              pk:1|                     2|2972ef96-279b-438...|  1|2020-01-01 11:22:30|       2|1577877750000000|        2|
   +-------------------+--------------------+------------------+----------------------+--------------------+---+-------------------+--------+----------------+---------+
   ```
   view schema 
   ```
   spark.sql("describe testTable2").show
   ```
   result
   ```
   +--------------------+---------+-------+
   |            col_name|data_type|comment|
   +--------------------+---------+-------+
   | _hoodie_commit_time|   string|   null|
   |_hoodie_commit_seqno|   string|   null|
   |  _hoodie_record_key|   string|   null|
   |_hoodie_partition...|   string|   null|
   |   _hoodie_file_name|   string|   null|
   |                  pk|      int|   null|
   |         time_string|   string|   null|
   |            sort_key|      int|   null|
   |           timestamp|   bigint|   null|
   |           partition|      int|   null|
   |# Partition Infor...|         |       |
   |          # col_name|data_type|comment|
   |           partition|      int|   null|
   +--------------------+---------+-------+
   ```
   
   
   **Environment Description**
   
   * Hudi version : 0.7.0
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   full code snippet
   ```
       import org.apache.spark.sql.functions._
       import org.apache.spark.sql.types._
       import org.apache.hudi.hive.MultiPartKeysValueExtractor
       import org.apache.hudi.QuickstartUtils._
       import scala.collection.JavaConversions._
       import org.apache.spark.sql.SaveMode
       import org.apache.hudi.DataSourceReadOptions._
       import org.apache.hudi.DataSourceWriteOptions._
       import org.apache.hudi.DataSourceWriteOptions
       import org.apache.hudi.config.HoodieWriteConfig._
       import org.apache.hudi.config.HoodieWriteConfig
       import org.apache.hudi.keygen.ComplexKeyGenerator
       import org.apache.hudi.common.model.DefaultHoodieRecordPayload
       import org.apache.hadoop.hive.conf.HiveConf
       val hiveConf = new HiveConf()
       val hiveMetastoreURI = hiveConf.get("hive.metastore.uris").replaceAll("thrift://", "")
       val hiveServer2URI = hiveMetastoreURI.substring(0, hiveMetastoreURI.lastIndexOf(":"))
       var hudiOptions = Map[String,String](
         HoodieWriteConfig.TABLE_NAME → "testTable2",
         "hoodie.consistency.check.enabled"->"true",
         DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
         DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "pk",
         DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> classOf[ComplexKeyGenerator].getName,
         DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY ->"partition",
         DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "sort_key",
         DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY → "true",
         DataSourceWriteOptions.HIVE_TABLE_OPT_KEY → "testTable2",
         DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY → "partition",
         DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY → classOf[MultiPartKeysValueExtractor].getName,
         DataSourceWriteOptions.HIVE_URL_OPT_KEY ->s"jdbc:hive2://$hiveServer2URI:10000",
         "hoodie.payload.ordering.field" -> "sort_key",
         DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY -> classOf[DefaultHoodieRecordPayload].getName
       )
   
   //spark.sql("drop table if exists testTable1")
   var seq = Seq((1, "2020-01-01 11:22:30", 2, 2))
   var df = seq.toDF("pk", "time_string" , "partition", "sort_key")
   df= df.withColumn("timestamp", col("time_string").cast(TimestampType))
   df.show
   df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://location")
   spark.sql("select * from testTable2").show
   ```
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-775173745


   Hi, @vinothchandar @satishkotha @zuyanton 
   Hello, I made some tests with redshift spectrum and athena, with redshift spectrum worked very good, but athena I will attach an image.
   <img width="1679" alt="Captura de Tela 2021-02-08 às 11 02 07" src="https://user-images.githubusercontent.com/36298331/107230012-475f4380-69fd-11eb-8ca3-b11b4ad7d9b7.png">
   
   Is there any workaround? I oppened an aws ticket but probably will take a while because the difference of presto version.
   
   I have some tables in regular parquet with timestamp fields, and it work, what the difference comparing to Hudi?
   
   thank you


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-772108742


   Great to know, I will test this feature in Athena and Redshift Spectrum, if someone already made this test, please let me know.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2509: [SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-855194302


   Just reading through this again. We def need to understand if this is an issue even when using Spark as the only engine (i.e no registration to HMS). and understand if parquet-avro is the problem child. 
   Running this with row writer enabled is a good way to quickly weed that out


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha edited a comment on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
satishkotha edited a comment on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-772787278


   @zuyanton yes, as i mentioned earlier some changes are needed in query engines. Refer to [this](https://github.com/prestodb/presto/pull/15074) change in presto for example. See [this ticket](https://issues.apache.org/jira/browse/HIVE-21215) for how this is fixed upstream in hive. You likely need to port this change to your hive deployment to make this work. (Or you could also upgrade your hive version to 4)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] arpanrkl7 commented on issue #2509: [SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
arpanrkl7 commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-1008567437


   When i am trying to read using spark-sql getting below error which was same mentioned by @zuyanton .
    java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha edited a comment on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
satishkotha edited a comment on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-772020119


   Hi
   
   If you set support_timestamp property mentioned [here](https://hudi.apache.org/docs/configurations.html#HIVE_SUPPORT_TIMESTAMP), hudi will convert the field to timestamp type in hive. 
   
   Note that hive/presto/athena query engines will need some more changes to interpret the field correctly as timestamp. Refer to [this](https://github.com/prestodb/presto/pull/15074) change in presto for example. We did similar changes in our internal hive deployment.
   
   Some more background: Hudi uses parquet-avro module which converts timestamp to INT64 with logical type TIMESTAMP_MICROS. Hive and other query engines expect timestamp to be in INT96 format. But INT96 is no longer supported. Recommended path forward is to deprecate int96 and change query engines to work with int64 type https://issues.apache.org/jira/browse/PARQUET-1883 has additional details.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Gatsby-Lee commented on issue #2509: [SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
Gatsby-Lee commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-984328231


   AWS Glue3 
   + Spark: 3.1.1-amzn-0
   + Hive: 2.3.7-amzn-4
   + Hudi: 0.9
   
   I had this issue.
   Although I can see timestamp type, the type I see through AWS Athena was bigint.
   
   I was able to handle this issue by setting this value when I insert data.
   "hoodie.datasource.hive_sync.support_timestamp": "true"
   
   But, I am not sure if there is any downside of setting this value to true.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Gatsby-Lee edited a comment on issue #2509: [SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
Gatsby-Lee edited a comment on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-1008552876


   @nsivabalan  
   after I got your msg, I queried to RT table. It still fails.
   I heard from AWS that the fix will be shipped out soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Gatsby-Lee commented on issue #2509: [SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
Gatsby-Lee commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-1008552876


   @nsivabalan  
   after I got your msg, I queried to RT table. It still fails.
   I heard from AWS that the fix will be shipped out at the end of Jan 2022.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-777679071


   @vinothchandar @satishkotha @zuyanton 
   
   I think the only workaround here is to convert the timestamp column to string, do you have better ideas?
   My timestamp column is not timestamp micro, my hudi avro timestamp is, it make senses?
   
   thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-780850528


   @nsivabalan it worked but I think a view it is not a good solution, because we will have a maintenence problem.
   
   It is not a Hudi fault, so we need to wait for athena, but I think it should not be solved soon...
   
   in Hudi side is there anything what we can do? My timestamp is not a timestamp micro is timestamp milisecond


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha edited a comment on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
satishkotha edited a comment on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-772020119


   Hi
   
   If you set support_timestamp property mentioned [here](https://hudi.apache.org/docs/configurations.html#HIVE_SUPPORT_TIMESTAMP), hudi will convert the field to timestamp type. 
   
   Note that hive/presto/athena query engines will need some more changes to interpret the field correctly as timestamp. Refer to [this](https://github.com/prestodb/presto/pull/15074) change in presto for example. We did similar changes in our internal hive deployment.
   
   Some more background: Hudi uses parquet-avro module which converts timestamp to INT64 with logical type TIMESTAMP_MICROS. Hive and other query engines expect timestamp to be in INT96 format. But INT96 is no longer supported. Recommended path forward is to deprecate int96 and change query engines to work with int64 type https://issues.apache.org/jira/browse/PARQUET-1883 has additional details.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zuyanton commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
zuyanton commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-772225257


   @satishkotha I added that parameter to my example, now  after writing data into s3 , when I run ```spark.sql("describe testTable3").show``` I get 
   ```
   +--------------------+---------+-------+  
   |            col_name|data_type|comment|
   +--------------------+---------+-------+
   ....
   |           timestamp|timestamp|   null|
   ```  
   which is good , however when I run ```spark.sql("select * from  testTable3").show``` I get exception    
   java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha edited a comment on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
satishkotha edited a comment on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-772020119






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-772108742


   Great to know, I will test this feature in Athena and Redshift Spectrum, if someone already made this test, please let me know.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2509: [SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-1008551882


   @umehrot2 @zhedoubushishi : Do you folks have any pointers on this. 
   @Gatsby-Lee : I guess athena added support for real time query in one of the latest versions. Did you try using latest athena? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-772793891


   @satishkotha could you help me how to explain to aws support which fixes should be applied to athena.
   
   @umehrot2 do you know if anything should be changed on emr?
   
   thank you


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-822571591


   @rubenssoto : just incase you haven't seen this https://github.com/apache/hudi/issues/2544. talks about timestamp and hive. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-774420151


   Going back to @zuyanton 's point, that is still from Spark. And are you suggesting that Spark's Hive version needs to also pick up the change? (that sounds painful)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2509: [SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-997565848


   @Gatsby-Lee : hoodie.datasource.hive_sync.support_timestamp is the right way to go. 
   
   @rubenssoto : is everything resolved on your end or are you still having any issues. Let us know. if things are resolved, feel free to close out the issue. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Gatsby-Lee commented on issue #2509: [SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
Gatsby-Lee commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-997673831


   @nsivabalan 
   
   Thank you for your comment.
   I am using "hoodie.datasource.hive_sync.support_timestamp"
   
   BTW, AWS Athena fails to read MoR Realtime table. ( Read Optimized table is ok )
   I found some articles that say this is related to the Query Engine. ( in this case, it's the managed Presto )
   so, I created a support ticket to AWS.
   
   Any input you want me to provide to AWS Athena team?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha edited a comment on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
satishkotha edited a comment on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-772020119


   Hi
   
   If you set support_timestamp property mentioned [here](https://hudi.apache.org/docs/configurations.html#HIVE_SUPPORT_TIMESTAMP), hudi will convert the field to timestamp type in hive. 
   
   Note that you need to verify compatibility of this with hive/presto/athena versions you are using. We made some changes to interpret the field correctly as timestamp. Refer to [this](https://github.com/prestodb/presto/pull/15074) change in presto for example. We did similar changes in our internal hive deployment.
   
   Some more background: Hudi uses parquet-avro module which converts timestamp to INT64 with logical type TIMESTAMP_MICROS. Hive and other query engines expect timestamp to be in INT96 format. But INT96 is no longer supported. Recommended path forward is to deprecate int96 and change query engines to work with int64 type https://issues.apache.org/jira/browse/PARQUET-1883 has additional details.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
satishkotha commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-772020119


   Hi
   
   If you set support_timestamp property mentioned [here](https://hudi.apache.org/docs/configurations.html#HIVE_SUPPORT_TIMESTAMP), hudi will convert the field to timestamp type in hive. 
   
   Note that hive/presto/athena query engines will need some more changes to interpret the field correctly as timestamp. Refer to [this](https://github.com/prestodb/presto/pull/15074) change in presto for example. We did similar changes in our internal hive deployment.
   
   Some more background: Hudi uses parquet-avro module which converts timestamp to INT64 with logical type TIMESTAMP_MICROS. Hive and other query engines expect timestamp to be in INT96 format. But INT96 is no longer supported. Recommended path forward is to deprecate int96 and change query engines to work with int64 type https://issues.apache.org/jira/browse/PARQUET-1883 has additional details.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
satishkotha commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-772020119


   Hi
   
   If you set support_timestamp property mentioned [here](https://hudi.apache.org/docs/configurations.html#HIVE_SUPPORT_TIMESTAMP), hudi will convert the field to timestamp type in hive. 
   
   Note that hive/presto/athena query engines will need some more changes to interpret the field correctly as timestamp. Refer to [this](https://github.com/prestodb/presto/pull/15074) change in presto for example. We did similar changes in our internal hive deployment.
   
   Some more background: Hudi uses parquet-avro module which converts timestamp to INT64 with logical type TIMESTAMP_MICROS. Hive and other query engines expect timestamp to be in INT96 format. But INT96 is no longer supported. Recommended path forward is to deprecate int96 and change query engines to work with int64 type https://issues.apache.org/jira/browse/PARQUET-1883 has additional details.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
satishkotha commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-772787278


   @zuyanton yes, as i mentioned earlier some changes are needed in query engines. Refer to [this](https://github.com/prestodb/presto/pull/15074) change in presto for example. See [this ticket](https://issues.apache.org/jira/browse/HIVE-21215) for how this is fixed upstream in hive. You likely need to port this change to your hive deployment to make this work.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-816812025


   Hello Guys,
   
   Athena Behavior changes,
   
   <img width="1677" alt="Captura de Tela 2021-04-08 às 22 55 24" src="https://user-images.githubusercontent.com/36298331/114213658-a841c400-9939-11eb-9fc9-a2e51761908e.png">
   <img width="1254" alt="Captura de Tela 2021-04-08 às 22 58 09" src="https://user-images.githubusercontent.com/36298331/114213672-ad067800-9939-11eb-872d-fe264f97fcde.png">
   
   
   This is a great news, but BETWEEN operator doesn't work.
   
   For exemple, this query works:
   select count(1) FROM "order" WHERE created_date >= cast('2021-04-07 03:00:00.000' as timestamp)
   
   and this query doens't work:
   select count(1) FROM "order" WHERE created_date between cast('2021-04-09 14:00:00.000' as timestamp) and cast('2021-04-09 15:00:00.000' as timestamp)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
satishkotha commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-772798888


   @rubenssoto AFAIK, athena is built on top of Presto. So you could ask them to apply above presto change. You can say this is needed for interpreting Parquet INT64 timestamp correctly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
satishkotha commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-781696815


   I think if you query using spark datasource APIs, queries will be able to read timestamp field correctly. Querying through Athena, i don't think there is another workaround unfortunately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-779369612


   @rubenssoto : Here is a link to suggestions from Athena support on timestamp conversion.
   https://github.com/apache/hudi/issues/2123#issuecomment-778464849
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-771414366


   @satishkotha Could you take a look at this one ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto edited a comment on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
rubenssoto edited a comment on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-816812025


   Hello Guys,
   
   @satishkotha  @nsivabalan
   
   Athena Behavior changes,
   
   <img width="1677" alt="Captura de Tela 2021-04-08 às 22 55 24" src="https://user-images.githubusercontent.com/36298331/114213658-a841c400-9939-11eb-9fc9-a2e51761908e.png">
   <img width="1254" alt="Captura de Tela 2021-04-08 às 22 58 09" src="https://user-images.githubusercontent.com/36298331/114213672-ad067800-9939-11eb-872d-fe264f97fcde.png">
   
   
   This is a great news, but BETWEEN operator doesn't work.
   
   For exemple, this query works:
   select count(1) FROM "order" WHERE created_date >= cast('2021-04-07 03:00:00.000' as timestamp)
   
   and this query doens't work:
   select count(1) FROM "order" WHERE created_date between cast('2021-04-09 14:00:00.000' as timestamp) and cast('2021-04-09 15:00:00.000' as timestamp)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-771414366


   @satishkotha Could you take a look at this one ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2509:
URL: https://github.com/apache/hudi/issues/2509#issuecomment-774420007


   Thanks for jumping in @satishkotha 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org