You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/17 08:28:26 UTC

[GitHub] [hudi] ChangbingChen opened a new issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplated records.

ChangbingChen opened a new issue #4618:
URL: https://github.com/apache/hudi/issues/4618


   **Describe the problem you faced**
   
   When querying a hudi table in hive, there have duplated records.
   
   This hudi table is produced by flink.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. submit a flink job
   flink-sql-client -f mysql_table_sink.sql
   
   the sql file content:
   create table `mysql_table_kafka` (
     `id` bigint,
     `create_time` timestamp,
     `update_time` timestamp
   )
   with (
   'connector' = 'kafka',
   'topic' = 'mysql_table_kafka',
   'properties.bootstrap.servers' = 'x.x.x.x:9092,x.x.x.x:9092,x.x.x.x:9092',
   'properties.group.id' = 'mysql_table_cg',
   'format' = 'canal-json',
   'scan.startup.mode' = 'latest-offset'
   );
   [INFO] Execute statement succeed.
   Flink SQL> 
   
   
   create table `mysql_table_sink_new` (
     `id` bigint,
     `create_time` bigint,
     `update_time` bigint
     `dt` varchar(20)
   ) partitioned by (`dt`)
   with(
   'connector' = 'hudi'
   ,'path' = 'hdfs://nameservice1/hudi/mysql_table_sink_new'
   ,'hoodie.datasource.write.recordkey.field' = 'id'
   ,'write.precombine.field' = 'create_time'
   ,'write.tasks' = '1'
   ,'compaction.tasks' = '1'
   ,'read.streaming.enabled' = 'true'
   ,'read.streaming.check-interval' = '60'
   ,'table.type' = 'MERGE_ON_READ'
   ,'compaction.async.enabled' = 'true'
   ,'compaction.trigger.strategy' = 'num_commits'
   ,'compaction.delta_commits' = '2'
   ,'hive_sync.enable' = 'true'
   ,'hive_sync.mode' = 'hms'
   ,'hive_sync.metastore.uris' = 'thrift://x.x.x.x:9083'
   ,'hive_sync.jdbc_url' = 'jdbc:hive2://x.x.x.x:10000'
   ,'hive_sync.table' = 'mysql_table_sink_new'
   ,'hive_sync.db' = 'hv_ods'
   );
   
   
   insert into mysql_table_sink_new select
   id
   ,unix_timestamp(date_format(create_time, 'yyyy-MM-dd HH:mm:ss'))*1000
   ,unix_timestamp(date_format(update_time, 'yyyy-MM-dd HH:mm:ss'))*1000
   ,date_format(create_time, 'yyyyMMdd') as dt from mysql_table_kafka;
   
   2.query in beeline
   
   select
   m
   ,sum(case when cnt=1 then 1 else 0 end) as one_cnt
   ,sum(case when cnt=2 then 1 else 0 end) as two_cnt
   ,sum(case when cnt=3 then 1 else 0 end) as three_cnt
   ,sum(case when cnt=4 then 1 else 0 end) as four_cnt
   from(
   select id,from_unixtime(cast(create_time/1000 as bigint), 'yyyy-MM-dd HH:mm') as m,count(1) as cnt
   from mysql_table_sink_new_ro
   where dt='20220117'
   group by id,from_unixtime(cast(create_time/1000 as bigint), 'yyyy-MM-dd HH:mm')
   )t
   group by m;
   +-------------------+----------+----------+------------+-----------+--+
   |         m         | one_cnt  | two_cnt  | three_cnt  | four_cnt  |
   +-------------------+----------+----------+------------+-----------+--+
   | 2022-01-17 16:07  | 0        | 0        | 0          | 5         |
   | 2022-01-17 16:08  | 0        | 0        | 0          | 273       |
   | 2022-01-17 16:09  | 0        | 0        | 37         | 241       |
   | 2022-01-17 16:10  | 0        | 0        | 340        | 0         |
   | 2022-01-17 16:11  | 0        | 21       | 239        | 0         |
   | 2022-01-17 16:12  | 0        | 253      | 0          | 0         |
   | 2022-01-17 16:13  | 38       | 261      | 0          | 0         |
   | 2022-01-17 16:14  | 283      | 0        | 0          | 0         |
   | 2022-01-17 16:15  | 247      | 0        | 0          | 0         |
   +-------------------+----------+----------+------------+-----------+--+
   
   
   select id,count(1) from mysql_table_sink_new_ro group by id having count(1)>1 limit 10;
   +------------+------+--+
   |     id     | _c1  |
   +------------+------+--+
   | 413588661  | 5    |
   | 413588664  | 5    |
   | 413588667  | 5    |
   | 413588670  | 5    |
   | 413588673  | 5    |
   | 413588676  | 5    |
   | 413588679  | 5    |
   | 413588682  | 5    |
   | 413588685  | 5    |
   | 413588688  | 5    |
   +------------+------+--+
   
   select `_hoodie_commit_time`,`_hoodie_commit_seqno`,`_hoodie_record_key`,`_hoodie_partition_path`,`_hoodie_file_name`,id,create_time,update_time from mysql_table_sink_new_ro where id='413588660';
   +----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+
   | _hoodie_commit_time  | _hoodie_commit_seqno  | _hoodie_record_key  | _hoodie_partition_path  |                 _hoodie_file_name                  |     id     |  create_time   |  update_time   |
   +----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+
   | 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
   | 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
   | 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
   | 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
   | 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
   +----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   
   * Spark version : xxx
   
   * Hive version : 1.1.0-cdh5.13.3
   
   * Hadoop version : 2.6.0-cdh5.13.3
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   * Flink version : 1.13.3
   
   **Additional context**
   
   Add any other context about the problem here.
   
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.

Posted by GitBox <gi...@apache.org>.

xiarixiaoyao commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015196785


   @ChangbingChen  sorry i forget one things,  before you use hive
    to query hoodie table, do you have  set inputformat,  eg: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat / or set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat
   
   if you have wechat？we can communicate directly through wechat 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.

Posted by GitBox <gi...@apache.org>.

xiarixiaoyao commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015031029


   @ChangbingChen   does  parquet files exists in your table？    if parquet file exists， pls set mapreduce.input.fileinputformat.split.maxsize >=（maxSize of paruert file) to forbiden hive spliting the parquet file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ChangbingChen commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.

Posted by GitBox <gi...@apache.org>.

ChangbingChen commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015160211


   > @ChangbingChen i know hudi has a bug for this。 if possible could you pls do modify for hudi code and package new hudi jar HoodieParquetRealtimeInputFormat.isSplitable
   > 
   > @OverRide protected boolean isSplitable(FileSystem fs, Path filename) { if (filename instanceof PathWithLogFilePath) { return ((PathWithLogFilePath)filename).splitable(); } // return super.isSplitable(fs, filename); return false; }
   
   @xiarixiaoyao , sorry, it doesn't work either.  i query the xxx_ro table,  the inputformat should be org.apache.hudi.hadoop.HoodieParquetInputFormat?
   
   By the way, there are four or five parquet files, and for each compaction opertation, a new parquet file wile be generated and the oldest parquet file will be deleted.
   So in hive query, it wile scan those all parquet files? perhaps the newest one contains all records?
   
   ```
   [yarn@x.x.x.x ~]$ hadoop fs -ls /hudi/mysql_table_sink_new/20220118
   Found 9 items
   -rw-r--r--   3 yarn supergroup   22309728 2022-01-18 15:22 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152035.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup   26237250 2022-01-18 15:24 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152235.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup   25088875 2022-01-18 15:26 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152436.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup   22962237 2022-01-18 15:28 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152636.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup         93 2022-01-18 15:15 /hudi/mysql_table_sink_new/20220118/.hoodie_partition_metadata
   -rw-r--r--   3 yarn supergroup    8456473 2022-01-18 15:21 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152035.parquet
   -rw-r--r--   3 yarn supergroup   10952244 2022-01-18 15:23 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152235.parquet
   -rw-r--r--   3 yarn supergroup   13875797 2022-01-18 15:25 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152436.parquet
   -rw-r--r--   3 yarn supergroup   16555809 2022-01-18 15:27 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152636.parquet
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1020230951


   CC @codope 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] gubinjie commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.

Posted by GitBox <gi...@apache.org>.

gubinjie commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015010548


   Hello, may I ask if you have encountered this kind of problem #4600
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ChangbingChen edited a comment on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.

Posted by GitBox <gi...@apache.org>.

ChangbingChen edited a comment on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015244832


   > @ChangbingChen sorry i forget one things, before you use hive to query hoodie table, do you have set inputformat, eg: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat / or set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat
   > 
   > if you have wechat？we can communicate directly through wechat
   
   great! thanks~~
   
   it's ok when set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat. 
   However, when set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat, the query throws an exception. It seems that it's a compatibility problem of hive version.
   the hive version i used is 1.1.0-cdh5.13.3, there is no the HiveInputFormat.pushProjectionsAndFilters function with same params type, and while hive version 2.3.1 does have.
   
   ```
   2022-01-18 17:14:04,796 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoSuchMethodError: org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.pushProjectionsAndFilters(Lorg/apache/hadoop/mapred/JobConf;Ljava/lang/Class;Lorg/apache/hadoop/fs/Path;)V
   	at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:551)
   	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169)
   	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:438)
   	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
   	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
   	at java.security.AccessController.doPrivileged(Native Method)
   	at javax.security.auth.Subject.doAs(Subject.java:422)
   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
   	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1018522017


   @xiarixiaoyao : I see you have mentioned that there is a bug in hoodie around HoodieParquetRealtimeInputFormat.isSplitable. Do we have a open PR around this. If not, do you think you can put up one. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ChangbingChen commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.

Posted by GitBox <gi...@apache.org>.

ChangbingChen commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015040351


   > @ChangbingChen does parquet files exists in your table？ if parquet file exists， pls set mapreduce.input.fileinputformat.split.maxsize >=（maxSize of paruert file) to forbiden hive spliting the parquet file.
   
   Thanks for reply.  It doesn't work. the default value is 256M.
   ```
   hive> set mapreduce.input.fileinputformat.split.maxsize;
   mapreduce.input.fileinputformat.split.maxsize=256000000
   ```
   
   and the maxsize of paruert file is less then 128M.
   ```
   [yarn@x.x.x ~]$ hadoop fs -ls /hudi/fintell_cloud_ft_request_context_sink_new/20220118
   Found 10 items
   -rw-r--r--   3 yarn supergroup    7157103 2022-01-18 11:17 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118111603.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup    7209495 2022-01-18 11:19 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118111759.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup   10402799 2022-01-18 11:21 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118111959.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup    7853954 2022-01-18 11:23 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118112159.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup    4666049 2022-01-18 11:24 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118112359.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup         93 2022-01-18 11:16 /hudi/mysql_table_sink_new/20220118/.hoodie_partition_metadata
   -rw-r--r--   3 yarn supergroup    1541035 2022-01-18 11:19 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118111759.parquet
   -rw-r--r--   3 yarn supergroup    2741308 2022-01-18 11:21 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118111959.parquet
   -rw-r--r--   3 yarn supergroup    4318101 2022-01-18 11:23 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118112159.parquet
   -rw-r--r--   3 yarn supergroup    5585232 2022-01-18 11:25 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118112359.parquet
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.

Posted by GitBox <gi...@apache.org>.

xiarixiaoyao commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015059696


   @ChangbingChen   i know hudi has a bug for this。
   if possible  could you pls  do modify for hudi code and package new hudi jar 
   HoodieParquetRealtimeInputFormat.isSplitable
   
     @Override
     protected boolean isSplitable(FileSystem fs, Path filename) {
       if (filename instanceof PathWithLogFilePath) {
         return ((PathWithLogFilePath)filename).splitable();
       }
       // return super.isSplitable(fs, filename);
       return false;
     }


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ChangbingChen removed a comment on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.

Posted by GitBox <gi...@apache.org>.

ChangbingChen removed a comment on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015144555


   
   
   
   
   Sorry, it doesn't work either.  i query the xxx_ro table,  the inputformat should be org.apache.hudi.hadoop.HoodieParquetInputFormat?
   
   By the way, there are four or five parquet files, and for each compaction opertation, a new parquet file wile be generated and the oldest parquet file will be deleted.
   So in hive query, it wile scan those all parquet files? perhaps the newest one contains all records?
   
   ```
   [yarn@x.x.x.x ~]$ hadoop fs -ls /hudi/mysql_table_sink_new/20220118
   Found 9 items
   -rw-r--r--   3 yarn supergroup   22309728 2022-01-18 15:22 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152035.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup   26237250 2022-01-18 15:24 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152235.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup   25088875 2022-01-18 15:26 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152436.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup   22962237 2022-01-18 15:28 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152636.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup         93 2022-01-18 15:15 /hudi/mysql_table_sink_new/20220118/.hoodie_partition_metadata
   -rw-r--r--   3 yarn supergroup    8456473 2022-01-18 15:21 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152035.parquet
   -rw-r--r--   3 yarn supergroup   10952244 2022-01-18 15:23 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152235.parquet
   -rw-r--r--   3 yarn supergroup   13875797 2022-01-18 15:25 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152436.parquet
   -rw-r--r--   3 yarn supergroup   16555809 2022-01-18 15:27 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152636.parquet
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ChangbingChen commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.

Posted by GitBox <gi...@apache.org>.

ChangbingChen commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015144555


   
   
   
   
   Sorry, it doesn't work either.  i query the xxx_ro table,  the inputformat should be org.apache.hudi.hadoop.HoodieParquetInputFormat?
   
   By the way, there are four or five parquet files, and for each compaction opertation, a new parquet file wile be generated and the oldest parquet file will be deleted.
   So in hive query, it wile scan those all parquet files? perhaps the newest one contains all records?
   
   ```
   [yarn@x.x.x.x ~]$ hadoop fs -ls /hudi/mysql_table_sink_new/20220118
   Found 9 items
   -rw-r--r--   3 yarn supergroup   22309728 2022-01-18 15:22 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152035.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup   26237250 2022-01-18 15:24 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152235.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup   25088875 2022-01-18 15:26 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152436.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup   22962237 2022-01-18 15:28 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152636.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup         93 2022-01-18 15:15 /hudi/mysql_table_sink_new/20220118/.hoodie_partition_metadata
   -rw-r--r--   3 yarn supergroup    8456473 2022-01-18 15:21 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152035.parquet
   -rw-r--r--   3 yarn supergroup   10952244 2022-01-18 15:23 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152235.parquet
   -rw-r--r--   3 yarn supergroup   13875797 2022-01-18 15:25 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152436.parquet
   -rw-r--r--   3 yarn supergroup   16555809 2022-01-18 15:27 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152636.parquet
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ChangbingChen edited a comment on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.

Posted by GitBox <gi...@apache.org>.

ChangbingChen edited a comment on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015040351


   > @ChangbingChen does parquet files exists in your table？ if parquet file exists， pls set mapreduce.input.fileinputformat.split.maxsize >=（maxSize of paruert file) to forbiden hive spliting the parquet file.
   
   Thanks for reply.  It doesn't work. the default value is 256M.
   ```
   hive> set mapreduce.input.fileinputformat.split.maxsize;
   mapreduce.input.fileinputformat.split.maxsize=256000000
   ```
   
   and the maxsize of paruert file is less then 128M.
   ```
   [yarn@x.x.x ~]$ hadoop fs -ls /hudi/mysql_table_sink_new/20220118
   Found 10 items
   -rw-r--r--   3 yarn supergroup    7157103 2022-01-18 11:17 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118111603.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup    7209495 2022-01-18 11:19 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118111759.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup   10402799 2022-01-18 11:21 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118111959.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup    7853954 2022-01-18 11:23 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118112159.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup    4666049 2022-01-18 11:24 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118112359.log.1_0-1-0
   -rw-r--r--   3 yarn supergroup         93 2022-01-18 11:16 /hudi/mysql_table_sink_new/20220118/.hoodie_partition_metadata
   -rw-r--r--   3 yarn supergroup    1541035 2022-01-18 11:19 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118111759.parquet
   -rw-r--r--   3 yarn supergroup    2741308 2022-01-18 11:21 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118111959.parquet
   -rw-r--r--   3 yarn supergroup    4318101 2022-01-18 11:23 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118112159.parquet
   -rw-r--r--   3 yarn supergroup    5585232 2022-01-18 11:25 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118112359.parquet
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ChangbingChen commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.

Posted by GitBox <gi...@apache.org>.

ChangbingChen commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015244832


   > @ChangbingChen sorry i forget one things, before you use hive to query hoodie table, do you have set inputformat, eg: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat / or set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat
   > 
   > if you have wechat？we can communicate directly through wechat
   
   great! thanks~~
   
   it's ok when set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat. 
   However, when set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat, the query throws an exception. It seems that it's a compatibility problem of hive version.
   the hive version i used is 1.1.0-cdh5.13.3, there is no the HiveInputFormat.pushProjectionsAndFilters function with same params type, and while hive version 2.3.1 does have.
   
   ```
   2022-01-18 17:14:04,796 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoSuchMethodError: org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.pushProjectionsAndFilters(Lorg/apache/hadoop/mapred/JobConf;Ljava/lang/Class;Lorg/apache/hadoop/fs/Path;)V
   	at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:551)
   	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169)
   	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:438)
   	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
   	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
   	at java.security.AccessController.doPrivileged(Native Method)
   	at javax.security.auth.Subject.doAs(Subject.java:422)
   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
   	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
   ```
   
   wx： 13488806793.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ChangbingChen closed issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.

Posted by GitBox <gi...@apache.org>.

ChangbingChen closed issue #4618:
URL: https://github.com/apache/hudi/issues/4618


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org