You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "taisenki (Jira)" <ji...@apache.org> on 2022/01/11 02:00:00 UTC

[jira] [Comment Edited] (HUDI-3204) spark on TimestampBasedKeyGenerator has no result when query by partition column

    [ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472398#comment-17472398 ] 

taisenki edited comment on HUDI-3204 at 1/11/22, 1:59 AM:
----------------------------------------------------------

[~biyan900116@gmail.com] [~shivnarayan] 

I think there should the input format to query ,  `TimestampBasedKeyGenerator ` just used for the partition key but not origin data. 

There also a situation such like:

when use *Read Optimized Queries* to read this data( mor data), there will result to format  '2018/09/24' :

```
scala> spark.time(spark.sql("select * from issue_4417_mor_ro where id = '1'").select("data_date").show);
 |data_date|
|----------|
 |2018/09/23|

Time taken: 12067 ms
scala> spark.time(spark.sql("select * from issue_4417_mor_rt where id = '1'").select("data_date").show);
|data_date|
|----------|
|2018-09-23|

Time taken: 30927 ms
scala>
```


was (Author: taisenki):
[~biyan900116@gmail.com] [~shivnarayan] 

I think there should the input format to query ,  `TimestampBasedKeyGenerator ` just used for the partition key but not origin data. 

There also a situation such like:

when use *Read Optimized Queries* to read this data( mor data), there where result to format  '2018/09/24' :
scala> spark.time(spark.sql("select * from issue_4417_mor_ro where id = '1'").select("data_date").show);
+----------+
| data_date|
+----------+
|2018/09/23|
+----------+
Time taken: 12067 ms
scala> spark.time(spark.sql("select * from issue_4417_mor_rt where id = '1'").select("data_date").show);
+----------+
| data_date|
+----------+
|2018-09-23|
+----------+
Time taken: 30927 ms
scala>

> spark on TimestampBasedKeyGenerator has no result when query by partition column
> --------------------------------------------------------------------------------
>
>                 Key: HUDI-3204
>                 URL: https://issues.apache.org/jira/browse/HUDI-3204
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Spark Integration
>            Reporter: Yann Byron
>            Assignee: Yann Byron
>            Priority: Major
>             Fix For: 0.11.0
>
>
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "yyyy/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "yyyy-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +-------------------+--------------------+------------------+----------------------+--------------------+---+----+---+---+----------+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|name|age| ts| data_date|
> +-------------------+--------------------+------------------+----------------------+--------------------+---+----+---+---+----------+
> |  20220110172709324|20220110172709324...|                 2|            2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +-------------------+--------------------+------------------+----------------------+--------------------+---+----+---+---+----------+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "yyyy/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "yyyy-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +-------------------+--------------------+------------------+----------------------+--------------------+---+----+---+---+----------+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|name|age| ts| data_date| +-------------------+--------------------+------------------+----------------------+--------------------+---+----+---+---+----------+ |  20220110172721896|20220110172721896...|                 2|            2018/09/24|81cc7819-a0d1-4e6...|  2|  z3| 35| v1|2018/09/24| |  20220110172721896|20220110172721896...|                 1|            2018/09/23|d428019b-a829-41a...|  1|  z3| 30| v1|2018/09/23| +-------------------+--------------------+------------------+----------------------+--------------------+---+----+---+---+----------+ 
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date = '2018-09-24'").show
> // but 2018/09/24 works
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date = '2018/09/24'").show  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)