You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/17 08:28:26 UTC
[GitHub] [hudi] ChangbingChen opened a new issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplated records.
ChangbingChen opened a new issue #4618:
URL: https://github.com/apache/hudi/issues/4618
**Describe the problem you faced**
When querying a hudi table in hive, there have duplated records.
This hudi table is produced by flink.
**To Reproduce**
Steps to reproduce the behavior:
1. submit a flink job
flink-sql-client -f mysql_table_sink.sql
the sql file content:
create table `mysql_table_kafka` (
`id` bigint,
`create_time` timestamp,
`update_time` timestamp
)
with (
'connector' = 'kafka',
'topic' = 'mysql_table_kafka',
'properties.bootstrap.servers' = 'x.x.x.x:9092,x.x.x.x:9092,x.x.x.x:9092',
'properties.group.id' = 'mysql_table_cg',
'format' = 'canal-json',
'scan.startup.mode' = 'latest-offset'
);
[INFO] Execute statement succeed.
Flink SQL>
create table `mysql_table_sink_new` (
`id` bigint,
`create_time` bigint,
`update_time` bigint
`dt` varchar(20)
) partitioned by (`dt`)
with(
'connector' = 'hudi'
,'path' = 'hdfs://nameservice1/hudi/mysql_table_sink_new'
,'hoodie.datasource.write.recordkey.field' = 'id'
,'write.precombine.field' = 'create_time'
,'write.tasks' = '1'
,'compaction.tasks' = '1'
,'read.streaming.enabled' = 'true'
,'read.streaming.check-interval' = '60'
,'table.type' = 'MERGE_ON_READ'
,'compaction.async.enabled' = 'true'
,'compaction.trigger.strategy' = 'num_commits'
,'compaction.delta_commits' = '2'
,'hive_sync.enable' = 'true'
,'hive_sync.mode' = 'hms'
,'hive_sync.metastore.uris' = 'thrift://x.x.x.x:9083'
,'hive_sync.jdbc_url' = 'jdbc:hive2://x.x.x.x:10000'
,'hive_sync.table' = 'mysql_table_sink_new'
,'hive_sync.db' = 'hv_ods'
);
insert into mysql_table_sink_new select
id
,unix_timestamp(date_format(create_time, 'yyyy-MM-dd HH:mm:ss'))*1000
,unix_timestamp(date_format(update_time, 'yyyy-MM-dd HH:mm:ss'))*1000
,date_format(create_time, 'yyyyMMdd') as dt from mysql_table_kafka;
2.query in beeline
select
m
,sum(case when cnt=1 then 1 else 0 end) as one_cnt
,sum(case when cnt=2 then 1 else 0 end) as two_cnt
,sum(case when cnt=3 then 1 else 0 end) as three_cnt
,sum(case when cnt=4 then 1 else 0 end) as four_cnt
from(
select id,from_unixtime(cast(create_time/1000 as bigint), 'yyyy-MM-dd HH:mm') as m,count(1) as cnt
from mysql_table_sink_new_ro
where dt='20220117'
group by id,from_unixtime(cast(create_time/1000 as bigint), 'yyyy-MM-dd HH:mm')
)t
group by m;
+-------------------+----------+----------+------------+-----------+--+
| m | one_cnt | two_cnt | three_cnt | four_cnt |
+-------------------+----------+----------+------------+-----------+--+
| 2022-01-17 16:07 | 0 | 0 | 0 | 5 |
| 2022-01-17 16:08 | 0 | 0 | 0 | 273 |
| 2022-01-17 16:09 | 0 | 0 | 37 | 241 |
| 2022-01-17 16:10 | 0 | 0 | 340 | 0 |
| 2022-01-17 16:11 | 0 | 21 | 239 | 0 |
| 2022-01-17 16:12 | 0 | 253 | 0 | 0 |
| 2022-01-17 16:13 | 38 | 261 | 0 | 0 |
| 2022-01-17 16:14 | 283 | 0 | 0 | 0 |
| 2022-01-17 16:15 | 247 | 0 | 0 | 0 |
+-------------------+----------+----------+------------+-----------+--+
select id,count(1) from mysql_table_sink_new_ro group by id having count(1)>1 limit 10;
+------------+------+--+
| id | _c1 |
+------------+------+--+
| 413588661 | 5 |
| 413588664 | 5 |
| 413588667 | 5 |
| 413588670 | 5 |
| 413588673 | 5 |
| 413588676 | 5 |
| 413588679 | 5 |
| 413588682 | 5 |
| 413588685 | 5 |
| 413588688 | 5 |
+------------+------+--+
select `_hoodie_commit_time`,`_hoodie_commit_seqno`,`_hoodie_record_key`,`_hoodie_partition_path`,`_hoodie_file_name`,id,create_time,update_time from mysql_table_sink_new_ro where id='413588660';
+----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+
| _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path | _hoodie_file_name | id | create_time | update_time |
+----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+
| 20220117160954 | 20220117160954_0_482 | 413588661 | 20220117 | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661 | 1642406878000 | 1642406878000 |
| 20220117160954 | 20220117160954_0_482 | 413588661 | 20220117 | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661 | 1642406878000 | 1642406878000 |
| 20220117160954 | 20220117160954_0_482 | 413588661 | 20220117 | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661 | 1642406878000 | 1642406878000 |
| 20220117160954 | 20220117160954_0_482 | 413588661 | 20220117 | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661 | 1642406878000 | 1642406878000 |
| 20220117160954 | 20220117160954_0_482 | 413588661 | 20220117 | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661 | 1642406878000 | 1642406878000 |
+----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+
**Expected behavior**
A clear and concise description of what you expected to happen.
**Environment Description**
* Hudi version : 0.10.0
* Spark version : xxx
* Hive version : 1.1.0-cdh5.13.3
* Hadoop version : 2.6.0-cdh5.13.3
* Storage (HDFS/S3/GCS..) : HDFS
* Running on Docker? (yes/no) : no
* Flink version : 1.13.3
**Additional context**
Add any other context about the problem here.
**Stacktrace**
```Add the stacktrace of the error.```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.
Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015196785
@ChangbingChen sorry i forget one things, before you use hive
to query hoodie table, do you have set inputformat, eg: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat / or set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat
if you have wechat?we can communicate directly through wechat
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.
Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015031029
@ChangbingChen does parquet files exists in your table? if parquet file exists, pls set mapreduce.input.fileinputformat.split.maxsize >=(maxSize of paruert file) to forbiden hive spliting the parquet file.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] ChangbingChen commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.
Posted by GitBox <gi...@apache.org>.
ChangbingChen commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015160211
> @ChangbingChen i know hudi has a bug for this。 if possible could you pls do modify for hudi code and package new hudi jar HoodieParquetRealtimeInputFormat.isSplitable
>
> @OverRide protected boolean isSplitable(FileSystem fs, Path filename) { if (filename instanceof PathWithLogFilePath) { return ((PathWithLogFilePath)filename).splitable(); } // return super.isSplitable(fs, filename); return false; }
@xiarixiaoyao , sorry, it doesn't work either. i query the xxx_ro table, the inputformat should be org.apache.hudi.hadoop.HoodieParquetInputFormat?
By the way, there are four or five parquet files, and for each compaction opertation, a new parquet file wile be generated and the oldest parquet file will be deleted.
So in hive query, it wile scan those all parquet files? perhaps the newest one contains all records?
```
[yarn@x.x.x.x ~]$ hadoop fs -ls /hudi/mysql_table_sink_new/20220118
Found 9 items
-rw-r--r-- 3 yarn supergroup 22309728 2022-01-18 15:22 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152035.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 26237250 2022-01-18 15:24 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152235.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 25088875 2022-01-18 15:26 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152436.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 22962237 2022-01-18 15:28 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152636.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 93 2022-01-18 15:15 /hudi/mysql_table_sink_new/20220118/.hoodie_partition_metadata
-rw-r--r-- 3 yarn supergroup 8456473 2022-01-18 15:21 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152035.parquet
-rw-r--r-- 3 yarn supergroup 10952244 2022-01-18 15:23 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152235.parquet
-rw-r--r-- 3 yarn supergroup 13875797 2022-01-18 15:25 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152436.parquet
-rw-r--r-- 3 yarn supergroup 16555809 2022-01-18 15:27 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152636.parquet
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1020230951
CC @codope
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] gubinjie commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.
Posted by GitBox <gi...@apache.org>.
gubinjie commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015010548
Hello, may I ask if you have encountered this kind of problem #4600
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] ChangbingChen edited a comment on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.
Posted by GitBox <gi...@apache.org>.
ChangbingChen edited a comment on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015244832
> @ChangbingChen sorry i forget one things, before you use hive to query hoodie table, do you have set inputformat, eg: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat / or set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat
>
> if you have wechat?we can communicate directly through wechat
great! thanks~~
it's ok when set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat.
However, when set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat, the query throws an exception. It seems that it's a compatibility problem of hive version.
the hive version i used is 1.1.0-cdh5.13.3, there is no the HiveInputFormat.pushProjectionsAndFilters function with same params type, and while hive version 2.3.1 does have.
```
2022-01-18 17:14:04,796 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoSuchMethodError: org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.pushProjectionsAndFilters(Lorg/apache/hadoop/mapred/JobConf;Ljava/lang/Class;Lorg/apache/hadoop/fs/Path;)V
at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:551)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:438)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1018522017
@xiarixiaoyao : I see you have mentioned that there is a bug in hoodie around HoodieParquetRealtimeInputFormat.isSplitable. Do we have a open PR around this. If not, do you think you can put up one.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] ChangbingChen commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.
Posted by GitBox <gi...@apache.org>.
ChangbingChen commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015040351
> @ChangbingChen does parquet files exists in your table? if parquet file exists, pls set mapreduce.input.fileinputformat.split.maxsize >=(maxSize of paruert file) to forbiden hive spliting the parquet file.
Thanks for reply. It doesn't work. the default value is 256M.
```
hive> set mapreduce.input.fileinputformat.split.maxsize;
mapreduce.input.fileinputformat.split.maxsize=256000000
```
and the maxsize of paruert file is less then 128M.
```
[yarn@x.x.x ~]$ hadoop fs -ls /hudi/fintell_cloud_ft_request_context_sink_new/20220118
Found 10 items
-rw-r--r-- 3 yarn supergroup 7157103 2022-01-18 11:17 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118111603.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 7209495 2022-01-18 11:19 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118111759.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 10402799 2022-01-18 11:21 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118111959.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 7853954 2022-01-18 11:23 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118112159.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 4666049 2022-01-18 11:24 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118112359.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 93 2022-01-18 11:16 /hudi/mysql_table_sink_new/20220118/.hoodie_partition_metadata
-rw-r--r-- 3 yarn supergroup 1541035 2022-01-18 11:19 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118111759.parquet
-rw-r--r-- 3 yarn supergroup 2741308 2022-01-18 11:21 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118111959.parquet
-rw-r--r-- 3 yarn supergroup 4318101 2022-01-18 11:23 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118112159.parquet
-rw-r--r-- 3 yarn supergroup 5585232 2022-01-18 11:25 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118112359.parquet
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.
Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015059696
@ChangbingChen i know hudi has a bug for this。
if possible could you pls do modify for hudi code and package new hudi jar
HoodieParquetRealtimeInputFormat.isSplitable
@Override
protected boolean isSplitable(FileSystem fs, Path filename) {
if (filename instanceof PathWithLogFilePath) {
return ((PathWithLogFilePath)filename).splitable();
}
// return super.isSplitable(fs, filename);
return false;
}
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] ChangbingChen removed a comment on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.
Posted by GitBox <gi...@apache.org>.
ChangbingChen removed a comment on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015144555
Sorry, it doesn't work either. i query the xxx_ro table, the inputformat should be org.apache.hudi.hadoop.HoodieParquetInputFormat?
By the way, there are four or five parquet files, and for each compaction opertation, a new parquet file wile be generated and the oldest parquet file will be deleted.
So in hive query, it wile scan those all parquet files? perhaps the newest one contains all records?
```
[yarn@x.x.x.x ~]$ hadoop fs -ls /hudi/mysql_table_sink_new/20220118
Found 9 items
-rw-r--r-- 3 yarn supergroup 22309728 2022-01-18 15:22 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152035.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 26237250 2022-01-18 15:24 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152235.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 25088875 2022-01-18 15:26 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152436.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 22962237 2022-01-18 15:28 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152636.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 93 2022-01-18 15:15 /hudi/mysql_table_sink_new/20220118/.hoodie_partition_metadata
-rw-r--r-- 3 yarn supergroup 8456473 2022-01-18 15:21 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152035.parquet
-rw-r--r-- 3 yarn supergroup 10952244 2022-01-18 15:23 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152235.parquet
-rw-r--r-- 3 yarn supergroup 13875797 2022-01-18 15:25 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152436.parquet
-rw-r--r-- 3 yarn supergroup 16555809 2022-01-18 15:27 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152636.parquet
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] ChangbingChen commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.
Posted by GitBox <gi...@apache.org>.
ChangbingChen commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015144555
Sorry, it doesn't work either. i query the xxx_ro table, the inputformat should be org.apache.hudi.hadoop.HoodieParquetInputFormat?
By the way, there are four or five parquet files, and for each compaction opertation, a new parquet file wile be generated and the oldest parquet file will be deleted.
So in hive query, it wile scan those all parquet files? perhaps the newest one contains all records?
```
[yarn@x.x.x.x ~]$ hadoop fs -ls /hudi/mysql_table_sink_new/20220118
Found 9 items
-rw-r--r-- 3 yarn supergroup 22309728 2022-01-18 15:22 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152035.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 26237250 2022-01-18 15:24 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152235.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 25088875 2022-01-18 15:26 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152436.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 22962237 2022-01-18 15:28 /hudi/mysql_table_sink_new/20220118/.77dc5111-0ed0-400c-9df3-84b254650ab5_20220118152636.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 93 2022-01-18 15:15 /hudi/mysql_table_sink_new/20220118/.hoodie_partition_metadata
-rw-r--r-- 3 yarn supergroup 8456473 2022-01-18 15:21 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152035.parquet
-rw-r--r-- 3 yarn supergroup 10952244 2022-01-18 15:23 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152235.parquet
-rw-r--r-- 3 yarn supergroup 13875797 2022-01-18 15:25 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152436.parquet
-rw-r--r-- 3 yarn supergroup 16555809 2022-01-18 15:27 /hudi/mysql_table_sink_new/20220118/77dc5111-0ed0-400c-9df3-84b254650ab5_0-1-0_20220118152636.parquet
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] ChangbingChen edited a comment on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.
Posted by GitBox <gi...@apache.org>.
ChangbingChen edited a comment on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015040351
> @ChangbingChen does parquet files exists in your table? if parquet file exists, pls set mapreduce.input.fileinputformat.split.maxsize >=(maxSize of paruert file) to forbiden hive spliting the parquet file.
Thanks for reply. It doesn't work. the default value is 256M.
```
hive> set mapreduce.input.fileinputformat.split.maxsize;
mapreduce.input.fileinputformat.split.maxsize=256000000
```
and the maxsize of paruert file is less then 128M.
```
[yarn@x.x.x ~]$ hadoop fs -ls /hudi/mysql_table_sink_new/20220118
Found 10 items
-rw-r--r-- 3 yarn supergroup 7157103 2022-01-18 11:17 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118111603.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 7209495 2022-01-18 11:19 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118111759.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 10402799 2022-01-18 11:21 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118111959.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 7853954 2022-01-18 11:23 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118112159.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 4666049 2022-01-18 11:24 /hudi/mysql_table_sink_new/20220118/.82f164fd-f97d-4691-b9c6-21bea2769be0_20220118112359.log.1_0-1-0
-rw-r--r-- 3 yarn supergroup 93 2022-01-18 11:16 /hudi/mysql_table_sink_new/20220118/.hoodie_partition_metadata
-rw-r--r-- 3 yarn supergroup 1541035 2022-01-18 11:19 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118111759.parquet
-rw-r--r-- 3 yarn supergroup 2741308 2022-01-18 11:21 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118111959.parquet
-rw-r--r-- 3 yarn supergroup 4318101 2022-01-18 11:23 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118112159.parquet
-rw-r--r-- 3 yarn supergroup 5585232 2022-01-18 11:25 /hudi/mysql_table_sink_new/20220118/82f164fd-f97d-4691-b9c6-21bea2769be0_0-1-0_20220118112359.parquet
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] ChangbingChen commented on issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.
Posted by GitBox <gi...@apache.org>.
ChangbingChen commented on issue #4618:
URL: https://github.com/apache/hudi/issues/4618#issuecomment-1015244832
> @ChangbingChen sorry i forget one things, before you use hive to query hoodie table, do you have set inputformat, eg: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat / or set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat
>
> if you have wechat?we can communicate directly through wechat
great! thanks~~
it's ok when set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat.
However, when set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat, the query throws an exception. It seems that it's a compatibility problem of hive version.
the hive version i used is 1.1.0-cdh5.13.3, there is no the HiveInputFormat.pushProjectionsAndFilters function with same params type, and while hive version 2.3.1 does have.
```
2022-01-18 17:14:04,796 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoSuchMethodError: org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.pushProjectionsAndFilters(Lorg/apache/hadoop/mapred/JobConf;Ljava/lang/Class;Lorg/apache/hadoop/fs/Path;)V
at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:551)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:438)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
```
wx: 13488806793.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] ChangbingChen closed issue #4618: [SUPPORT] When querying a hudi table in hive, there have duplicated records.
Posted by GitBox <gi...@apache.org>.
ChangbingChen closed issue #4618:
URL: https://github.com/apache/hudi/issues/4618
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org