You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Danny Chen (Jira)" <ji...@apache.org> on 2021/11/10 07:43:00 UTC
[jira] [Commented] (HUDI-2086) redo the logical of mor_incremental_view for hive

    [ https://issues.apache.org/jira/browse/HUDI-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441555#comment-17441555 ] 

Danny Chen commented on HUDI-2086:
----------------------------------

Fixed via master branch: a40ac62e0ce7ec807a00803d23ed223d7c607459

> redo the logical of mor_incremental_view for hive
> -------------------------------------------------
>
>                 Key: HUDI-2086
>                 URL: https://issues.apache.org/jira/browse/HUDI-2086
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Hive Integration
>    Affects Versions: 0.9.0
>         Environment: spark3.1.1
> hive3.1.1
> hadoop3.1.1
> os： suse
>            Reporter: tao meng
>            Assignee: tao meng
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.10.0
>
>
> now ，There are some problems with mor_incremental_view for hive。
> For example，
> 1）：*hudi cannot read the lastest incremental datas which are stored by logs*
> think that:  create a mor table with bulk_insert, and then do upsert for this table, 
> no we want to query the latest incremental data by hive/sparksql,   however the lastest incremental datas are stored by logs,   when we do query nothings will return
> step1: prepare data
> val df = spark.sparkContext.parallelize(0 to 20, 2).map(x => testCase(x, x+"jack", Random.nextInt(2))).toDF()
>  .withColumn("col3", expr("keyid + 3000"))
>  .withColumn("p", lit(1))
> step2: do bulk_insert
> mergePartitionTable(df, 4, "default", "inc", tableType = DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
> step3: do upsert
> mergePartitionTable(df, 4, "default", "inc", tableType = DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")
> step4:  check the lastest commit time and do query
> spark.sql("set hoodie.inc.consume.mode=INCREMENTAL")
> spark.sql("set hoodie.inc.consume.max.commits=1")
> spark.sql("set hoodie.inc.consume.start.timestamp=20210628103935")
> spark.sql("select keyid, col3 from inc_rt where `_hoodie_commit_time` > '20210628103935' order by keyid").show(100, false)
> +-----+----+
> |keyid|col3|
> +-----+----+
> +-----+----+
>  
> 2）：*if we do insert_over_write/insert_over_write_table for hudi mor table, the incr query result is wrong when we want to query the data before insert_overwrite/insert_overwrite_table*
> step1: do bulk_insert 
> mergePartitionTable(df, 4, "default", "overInc", tableType = DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
> now the commits is
> [20210628160614.deltacommit ]
> step2: do insert_overwrite_table
> mergePartitionTable(df, 4, "default", "overInc", tableType = DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert_overwrite_table")
> now the commits is
> [20210628160614.deltacommit, 20210628160923.replacecommit ]
> step3: query the data before insert_overwrite_table
> spark.sql("set hoodie.overInc.consume.mode=INCREMENTAL")
> spark.sql("set hoodie.overInc.consume.max.commits=1")
> spark.sql("set hoodie.overInc.consume.start.timestamp=0")
> spark.sql("select keyid, col3 from overInc_rt where `_hoodie_commit_time` > '0' order by keyid").show(100, false)
> +-----+----+
> |keyid|col3|
> +-----+----+
> +-----+----+
>  
> 3) *hive/presto/flink  cannot read  file groups which has only logs*
> when we use hbase/inmemory as index, mor table will produce log files instead of parquet file, but now hive/presto cannot read those files since those files are log files.
> *HUDI-2048* mentions this problem.
>  
> however when we use spark data source to executre incremental query， there is no such problem above。keep the logical of mor_incremental_view for hive as the same logicl as spark dataSource is necessary。
> we redo the logical of mor_incremental_view for hive，to solve above problems and keep the logical of mor_incremental_view  as the same logicl as spark dataSource
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)