You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/15 08:10:00 UTC
[GitHub] [hudi] gubinjie commented on issue #4600: [SUPPORT]When hive queries Hudi data, the query path is wrong

gubinjie commented on issue #4600:
URL: https://github.com/apache/hudi/issues/4600#issuecomment-1013640773


   Version list:
   Flink: 1.13.3
   CDH:6.3.2
   Hive: 2.1.1-cdh6.3.2 rb3393cf499504df1d2a12d34b4285e5d0c02be11
   Hudi: 0.10.0-rc2
   The Jar package compiled by myself is:
   **hudi-hadoop-mr-bundle-0.10.0-rc2.jar
   hudi-hive-sync-bundle-0.10.0-rc2.jar**
   
   Flink startup method:
   **./yarn-session.sh -s 2 -jm 4048 -tm 4048 -nm ys-hudi01 -d**
   Flink Sql-Client startup method:
   **./sql-client.sh embedded -j ../lib/hudi-flink-bundle_2.12-0.10.0-rc2.jar**
   
   **Create Flink table script:**
   CREATE TABLE guhudi(
    id bigint ,
    name string,
    birthday TIMESTAMP(3),
    tsTIMESTAMP(3),
    `partition` VARCHAR(20),
    primary key(id) not enforced -- the uuid primary key must be specified
   )
   PARTITIONED BY (`partition`)
   with(
    'connector'='hudi',
    **'path' = 'hdfs://paat-dev/user/hudi/guhudi'**
   , 'hoodie.datasource.write.recordkey.field' = 'id'
   , 'write.precombine.field' = 'ts'
   , 'write.tasks' = '1'
   , 'compaction.tasks' = '1'
   , 'write.rate.limit' = '2000'
   , 'table.type' = 'MERGE_ON_READ'
   , 'compaction.async.enable' = 'true'
   , 'compaction.trigger.strategy' = 'num_commits'
   , 'compaction.delta_commits' = '5'
   , 'changelog.enable' = 'true'
   , 'read.streaming.enable' = 'true'
   , 'read.streaming.check-interval' = '4'
   , 'hive_sync.enable' = 'true'
   , 'hive_sync.mode'= 'hms'
   , 'hive_sync.metastore.uris' = 'thrift://****:9083'
   , 'hive_sync.jdbc_url' = 'jdbc:hive2://****:10000'
   , 'hive_sync.table' = 'guhudi'
   , 'hive_sync.db' = 'hudi'
   , 'hive_sync.username' = '****'
   , 'hive_sync.password' = '*****'
   , 'hive_sync.support_timestamp' = 'true'
   );
   Insert data:
   insert into guhudi select 15,'test2',TIMESTAMP '1970-01-01 00:00:01',TIMESTAMP '1970-01-01 00:00:01','part1';
   insert into guhudi select 16,'test1',TIMESTAMP '1970-01-01 00:00:01',TIMESTAMP '1970-01-01 00:00:01','part1';
   insert into guhudi select 17,'test1',TIMESTAMP '1970-01-01 00:00:01',TIMESTAMP '1970-01-01 00:00:01','part1';
   insert into guhudi select 18,'test1',TIMESTAMP '1970-01-01 00:00:01',TIMESTAMP '1970-01-01 00:00:01','part2';
   insert into guhudi select 19,'test1',TIMESTAMP '1970-01-01 00:00:01',TIMESTAMP '1970-01-01 00:00:01','part2';
   insert into guhudi select 20,'test1',TIMESTAMP '1970-01-01 00:00:01',TIMESTAMP '1970-01-01 00:00:01','part2';
   **Partial configuration execution:
   set execution.checkpointing.interval=30sec;
   set sql-client.execution.result-mode=tableau;**
   
   =================================================
   Flink SQL> select * from guhudi;
   2022-01-15 15:59:56,863 INFO org.apache.hadoop.yarn.client.RMProxy[] - Connecting to ResourceManager at lo-t-bd-nn/172.16.7.55:8032
   2022-01-15 15:59:56,864 INFO org.apache.flink.yarn.YarnClusterDescriptor[] - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
   2022-01-15 15:59:56,864 WARN org.apache.flink.yarn.YarnClusterDescriptor[] - Neither the HADOOP_CONF_DIR nor the YARN_CONF_DIR environment variable is set. The Flink YARN Client needs one of these to be set to properly load the Hadoop configuration for accessing YARN.
   2022-01-15 15:59:56,867 INFO org.apache.flink.yarn.YarnClusterDescriptor[] - Found Web Interface lo-t-work1:38477 of application 'application_1642128573447_0031'.
   +----+----------------------+--------------------- -------------+-------------------------+------------ -------------+--------------------------------+
   | op | id | name | birthday | ts | partition |
   +----+----------------------+--------------------- -------------+-------------------------+------------ -------------+--------------------------------+
   | +I | 18 | test1 | 1970-01-01 00:00:01.000 | 1970-01-01 00:00:01.000 | part2 |
   | +I | 19 | test1 | 1970-01-01 00:00:01.000 | 1970-01-01 00:00:01.000 | part2 |
   | +I | 20 | test1 | 1970-01-01 00:00:01.000 | 1970-01-01 00:00:01.000 | part2 |
   | +I | 15 | test2 | 1970-01-01 00:00:01.000 | 1970-01-01 00:00:01.000 | part1 |
   | +I | 16 | test1 | 1970-01-01 00:00:01.000 | 1970-01-01 00:00:01.000 | part1 |
   | +I | 17 | test1 | 1970-01-01 00:00:01.000 | 1970-01-01 00:00:01.000 | part1 |
   +----+----------------------+--------------------- -------------+-------------------------+------------ -------------+--------------------------------+
   Received a total of 6 rows
   
   =================FlinkSql query ends ===================
   **After the above execution is completed, the Hive table automatically creates the rt\ro table:**
   hive> show tables;
   OK
   guhudi_ro
   guhudi_rt
   Time taken: 0.101 seconds, Fetched: 2 row(s)
   ====================================================
   
   
   以上rt\ro表表结构为：
   hive> show create table guhudi_ro;
   OK
   CREATE EXTERNAL TABLE `guhudi_ro`(
     `_hoodie_commit_time` string COMMENT '', 
     `_hoodie_commit_seqno` string COMMENT '', 
     `_hoodie_record_key` string COMMENT '', 
     `_hoodie_partition_path` string COMMENT '', 
     `_hoodie_file_name` string COMMENT '', 
     `id` bigint COMMENT '', 
     `name` string COMMENT '', 
     `birthday` bigint COMMENT '', 
     `ts` bigint COMMENT '')
   PARTITIONED BY ( 
     `partition` string COMMENT '')
   ROW FORMAT SERDE 
     'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
   WITH SERDEPROPERTIES ( 
     'hoodie.query.as.ro.table'='true', 
     **'path'='hdfs://paat-dev/user/hudi/guhudi')** 
   STORED AS INPUTFORMAT 
     'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
   OUTPUTFORMAT 
     'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION
     **'hdfs://paat-dev/user/hudi/guhudi'**
   TBLPROPERTIES (
     'last_commit_time_sync'='20220115155947483', 
     'spark.sql.sources.provider'='hudi', 
     'spark.sql.sources.schema.numPartCols'='1', 
     'spark.sql.sources.schema.numParts'='1', 
     'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"birthday","type":"timestamp","nullable":true,"metadata":{}},{"name":"ts","type":"timestamp","nullable":true,"metadata":{}},{"name":"partition","type":"string","nullable":true,"metadata":{}}]}', 
     'spark.sql.sources.schema.partCol.0'='partition', 
     'transient_lastDdlTime'='1642149108')
   Time taken: 0.088 seconds, Fetched: 31 row(s)
   
   =================================================
   hive> show create table guhudi_rt;
   OK
   CREATE EXTERNAL TABLE `guhudi_rt`(
     `_hoodie_commit_time` string COMMENT '', 
     `_hoodie_commit_seqno` string COMMENT '', 
     `_hoodie_record_key` string COMMENT '', 
     `_hoodie_partition_path` string COMMENT '', 
     `_hoodie_file_name` string COMMENT '', 
     `id` bigint COMMENT '', 
     `name` string COMMENT '', 
     `birthday` bigint COMMENT '', 
     `ts` bigint COMMENT '')
   PARTITIONED BY ( 
     `partition` string COMMENT '')
   ROW FORMAT SERDE 
     'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
   WITH SERDEPROPERTIES ( 
     'hoodie.query.as.ro.table'='false', 
     'path'='hdfs://paat-dev/user/hudi/guhudi') 
   STORED AS INPUTFORMAT 
     'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat' 
   OUTPUTFORMAT 
     'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION
     **'hdfs://paat-dev/user/hudi/guhudi'**
   TBLPROPERTIES (
     'last_commit_time_sync'='20220115155947483', 
     'spark.sql.sources.provider'='hudi', 
     'spark.sql.sources.schema.numPartCols'='1', 
     'spark.sql.sources.schema.numParts'='1', 
     'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"birthday","type":"timestamp","nullable":true,"metadata":{}},{"name":"ts","type":"timestamp","nullable":true,"metadata":{}},{"name":"partition","type":"string","nullable":true,"metadata":{}}]}', 
     'spark.sql.sources.schema.partCol.0'='partition', 
     'transient_lastDdlTime'='1642149108')
   Time taken: 0.037 seconds, Fetched: 31 row(s)
   ============================================================
   
   Query whether the data of the hudi table is in hdfs:
   [root@lo-t-work1 ~]# hadoop fs -ls /user/hudi/guhudi
   Found 3 items
   **drwxr-xr-x - root supergroup 0 2022-01-15 15:59 /user/hudi/guhudi/.hoodie
   drwxr-xr-x - root supergroup 0 2022-01-15 15:58 /user/hudi/guhudi/part1
   drwxr-xr-x - root supergroup 0 2022-01-15 15:59 /user/hudi/guhudi/part2**
   [root@lo-t-work1 ~]# hadoop fs -ls /user/hudi/guhudi/part1
   Found 2 items
   -rw-r--r-- 3 root supergroup 3948 2022-01-15 15:59 /user/hudi/guhudi/part1/.6dcddda7-2a1b-44b5-a794-3f9058d005e7_20220115155847944.log.1_0-1-0
   -rw-r--r-- 3 root supergroup 96 2022-01-15 15:58 /user/hudi/guhudi/part1/.hoodie_partition_metadata
   [root@lo-t-work1 ~]# hadoop fs -ls /user/hudi/guhudi/part2
   Found 2 items
   -rw-r--r-- 3 root supergroup 2961 2022-01-15 15:59 /user/hudi/guhudi/part2/.f4ba4d90-5035-455e-ae99-4932dde39b77_20220115155937926.log.1_0-1-0
   -rw-r--r-- 3 root supergroup 96 2022-01-15 15:59 /user/hudi/guhudi/part2/.hoodie_partition_metadata
   Check and find that a hudi file has been generated。
   ===============================================
   **But using hive query is impossible to query, it will be queried under other paths, such as the question posted on the first floor。**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org