You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "stayrascal (via GitHub)" <gi...@apache.org> on 2023/02/24 09:20:37 UTC

[GitHub] [hudi] stayrascal opened a new issue, #8038: [SUPPORT] How to query different queries via Hive 3

stayrascal opened a new issue, #8038:
URL: https://github.com/apache/hudi/issues/8038

   **Describe the problem you faced**
   
   Hi, I using Flink 1.16 create some tables(COW & MOR) base HoodieCatalog, and write data to these tables, and try to use other engines to query data, but I'm confused how to use Hive to query incremental queries, I didn't found any official documents about these operation exception the demo from https://github.com/apache/hudi/blob/master/docker/demo/hive-incremental-mor-rt.commands.
   
   And I meet two problems:
   - Regarding a MOR table which hasn't done any compactions, count aggregation operation based on RO table can get the expected result, but query all fields get empty result.
   - Regarding a MOR table which has done some compactions, the incremental queries seems that not working.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   Flink
   1. Create Hoodie Catalog & Tables & Insert Data.
   ```
   CREATE CATALOG hms_catalog WITH (
       'type'='hudi',
       'catalog.path'='hdfs://xxxxx-1:8020/xxxx/hive',
       'hive.conf.dir'='/xxxxx/hive/conf/',
       'mode'='hms'
   );
   
   CREATE TABLE flink_hudi_mor_tbl(
     uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
     name VARCHAR(10),
     age INT,
     ts TIMESTAMP(3),
     `partition` VARCHAR(20)
   )
   PARTITIONED BY (`partition`)
   WITH (
     'connector' = 'hudi',
     'table.type' = 'MERGE_ON_READ',
     'hoodie.datasource.write.recordkey.field' = 'uuid',
     'precombine.field' = 'ts',
     'hive_sync.enabled' = 'true'
   );
   
   CREATE TABLE hms_catalog.hudi_hms_db.flink_hudi_mor_streaming_tbl(
     uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
     name VARCHAR(10),
     age INT,
     ts TIMESTAMP(3),
     `partition` VARCHAR(20)
   )
   PARTITIONED BY (`partition`)
   WITH (
     'connector' = 'hudi',
     'table.type' = 'MERGE_ON_READ',
     'hoodie.datasource.write.recordkey.field' = 'uuid',
     'precombine.field' = 'ts',
     'hive_sync.enabled' = 'true'
   );
   
   CREATE TABLE flink_hudi_cow_tbl(
     uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
     name VARCHAR(10),
     age INT,
     ts TIMESTAMP(3),
     `partition` VARCHAR(20)
   )
   PARTITIONED BY (`partition`)
   WITH (
     'connector' = 'hudi',
     'table.type' = 'COPY_ON_WRITE',
     'hoodie.datasource.write.recordkey.field' = 'uuid',
     'precombine.field' = 'ts'
   );
   
   -- write twice with different uuid, and no compaction triggered since there are only 2 commits
   INSERT INTO `hms_catalog`.`hudi_hms_db`.`flink_hudi_mor_tbl` VALUES
     ('id31','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
     ('id32','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
     ('id33','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
     ('id34','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
     ('id35','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
     ('id36','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
     ('id37','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
     ('id38','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
   
   -- write to another mor table, and trigger compactions
   CREATE TABLE `default_catalog`.`default_database`.`fake_datasource` ( 
       `uuid` STRING, 
       `name` STRING,
       `age` INT,
       `ts` AS PROCTIME(),
       `partition` VARCHAR(20)
   ) WITH (
     'connector' = 'faker', 
     'rows-per-second' = '2',
     'fields.uuid.expression' = '#{numerify ''id####''}',
     'fields.name.expression' = '#{superhero.name}',
     'fields.age.expression' = '#{number.numberBetween ''20'',''50''}',
     'fields.partition.expression' = '#{Options.option ''par1'',''par2'',''par3'',''par4'')}',
     'fields.ts.expression' =  '#{date.past ''45'',''10'',''SECONDS''}'
   );
   
   INSERT INTO hms_catalog.hudi_hms_db.flink_hudi_mor_streaming_tbl select `default_catalog`.`default_database`.`fake_datasource`;
   
   -- write three times with different uuid
     INSERT INTO `hms_catalog`.`hudi_hms_db`.`flink_hudi_cow_tbl` VALUES
     ('id31','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
     ('id32','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
     ('id33','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
     ('id34','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
     ('id35','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
     ('id36','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
     ('id37','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
     ('id38','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
   ```
   2. Use Spark to show commits
   ```
   spark-sql> call show_commits(table => 'hudi_hms_db.flink_hudi_mor_tbl');
   20230216160243458        4267        0        4        4        8        8        0
   20230216160153391        4260        0        4        4        8        0        0
   Time taken: 0.084 seconds, Fetched 2 row(s)
   
   spark-sql> call show_commits(table => 'hudi_hms_db.flink_hudi_cow_tbl');
   20230219145900116        1741955        0        4        4        24        0        0
   20230216154007116        1741814        0        4        4        16        0        0
   20230216154001168        1741175        4        0        4        8        0        0
   Time taken: 0.44 seconds, Fetched 3 row(s)
   
   spark-sql> call show_commits(table => 'hudi_hms_db.flink_hudi_mor_streaming_tbl');
   20230216222718023        16248        0        4        4        60        60        0
   20230216222648287        16137        0        4        4        60        60        0
   20230216222617980        16256        0        4        4        60        60        0
   20230216222548075        16363        0        4        4        60        60        0
   20230216222548025        1917272      0        4        4        8163      71        0
   20230216222518018        16266        0        4        4        60        60        0
   20230216222448002        16384        0        4        4        60        60        0
   20230216222418042        16273        0        4        4        60        60        0
   20230216222347982        16307        0        4        4        60        60        0
   20230216222318041        16275        0        4        4        60        60        0
   Time taken: 0.533 seconds, Fetched 10 row(s)
   ``` 
   3. Problem 1: get empty result about query all records, but count aggregation works.
   ```
   0: jdbc:hive2://xxxx-1:10000/> set hive.vectorized.execution.enabled=false;
   0: jdbc:hive2://xxxx-1:10000/> select count(*) from flink_hudi_mor_tbl_rt;
   +------+
   | _c0  |
   +------+
   | 16   |
   +------+
   1 row selected (8.158 seconds)
   0: jdbc:hive2://xxxx-1:10000/> select * from flink_hudi_mor_tbl_rt;
   +--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+------------------------------------------+-----------------------------+-----------------------------+----------------------------+---------------------------+----------------------------------+
   | flink_hudi_mor_tbl_rt._hoodie_commit_time  | flink_hudi_mor_tbl_rt._hoodie_commit_seqno  | flink_hudi_mor_tbl_rt._hoodie_record_key  | flink_hudi_mor_tbl_rt._hoodie_partition_path  | flink_hudi_mor_tbl_rt._hoodie_file_name  | flink_hudi_mor_tbl_rt.uuid  | flink_hudi_mor_tbl_rt.name  | flink_hudi_mor_tbl_rt.age  | flink_hudi_mor_tbl_rt.ts  | flink_hudi_mor_tbl_rt.partition  |
   +--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+------------------------------------------+-----------------------------+-----------------------------+----------------------------+---------------------------+----------------------------------+
   +--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+------------------------------------------+-----------------------------+-----------------------------+----------------------------+---------------------------+----------------------------------+
   No rows selected (0.143 seconds)
   ```
   5.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   0.12.2
   * Spark version :
   5.2.1
   * Hive version :
   3.1.2
   * Hadoop version :
   3.3.4
   * Storage (HDFS/S3/GCS..) :
   HDFS
   * Running on Docker? (yes/no) :
   No
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8038: [SUPPORT] How to query different queries via Hive 3 & Trino

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8038:
URL: https://github.com/apache/hudi/issues/8038#issuecomment-1445868438

   The support for Trino MOR table type would be emerged in release 0.14.0, for Hive3 query, did you modify some code in Hive so support that, here is a document about Hive & Flink: https://www.yuque.com/docs/share/879349ce-7de4-4284-9126-9c2a3c93a91d?#%20%E3%80%8AHive%20On%20Hudi%E3%80%8B


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] stayrascal commented on issue #8038: [SUPPORT] How to query different queries via Hive 3 & Trino

Posted by "stayrascal (via GitHub)" <gi...@apache.org>.
stayrascal commented on issue #8038:
URL: https://github.com/apache/hudi/issues/8038#issuecomment-1443400947

   And for Trino case, it cannot count all records base on RT table.
   
   ```
   trino> select count(*) from hive.hudi_hms_db.flink_hudi_mor_streaming_tbl_rt;
    _col0
   -------
     8163
   (1 row)
   
   Query 20230224_100730_00008_sxxri, FINISHED, 2 nodes
   Splits: 21 total, 21 done (100.00%)
   0.65 [8.16K rows, 1.72MB] [12.6K rows/s, 2.66MB/s]
   
   trino> select count(*) from hive.hudi_hms_db.flink_hudi_mor_streaming_tbl_ro;
    _col0
   -------
     8163
   (1 row)
   
   Query 20230224_100735_00009_sxxri, FINISHED, 2 nodes
   Splits: 21 total, 21 done (100.00%)
   0.61 [8.16K rows, 1.72MB] [13.3K rows/s, 2.8MB/s]
   
   trino> select count(*) from hive.hudi_hms_db.flink_hudi_mor_streaming_tbl;
    _col0
   -------
        0
   (1 row)
   
   Query 20230224_100738_00010_sxxri, FINISHED, 2 nodes
   Splits: 18 total, 18 done (100.00%)
   0.56 [0 rows, 0B] [0 rows/s, 0B/s]
   ```
   
   And if the MOR table haven't done any compaction, query on RT table will throw a exception that the base file not exist, is an expected behavior?
   
   ```
   trino> select * from hive.hudi_hms_db.flink_hudi_mor_tbl_rt;
   
   Query 20230224_100913_00011_sxxri, FAILED, 2 nodes
   Splits: 4 total, 0 done (0.00%)
   0.51 [0 rows, 0B] [0 rows/s, 0B/s]
   
   Query 20230224_100913_00011_sxxri failed: Not valid Parquet file: hdfs://xxxxxxx/hive/hudi_hms_db/flink_hudi_mor_tbl/par3/.83b4db58-a84b-40b5-b38d-d79acfa8db3c_20230216160153391.log.1_0-1-0 expected magic number: PAR1 got: #
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] stayrascal commented on issue #8038: [SUPPORT] How to query different queries via Hive 3

Posted by "stayrascal (via GitHub)" <gi...@apache.org>.
stayrascal commented on issue #8038:
URL: https://github.com/apache/hudi/issues/8038#issuecomment-1443380149

   So regarding using Hive to query incremental queries of COW & MOR table, we have to add `_hoodie_commit_time ` filter condition on MOR, but COW doesn't need, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #8038: [SUPPORT] How to query different queries via Hive 3 & Trino

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #8038:
URL: https://github.com/apache/hudi/issues/8038#issuecomment-1455054456

   Currently, snapshot query using Trino is only supported for COW tables. For MOR tables, you can only do read optimized queries. Support matrix - https://hudi.apache.org/docs/querying_data#support-matrix
   We are working on adding MoR snapshot query support https://github.com/trinodb/trino/pull/14786


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope closed issue #8038: [SUPPORT] How to query different queries via Hive 3 & Trino

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope closed issue #8038: [SUPPORT] How to query different queries via Hive 3 & Trino
URL: https://github.com/apache/hudi/issues/8038


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org