You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "MihawkZoro (via GitHub)" <gi...@apache.org> on 2023/02/03 08:31:37 UTC
[GitHub] [hudi] MihawkZoro opened a new issue, #7839: [BUG] the deleted data reappeared after clustering on the table
MihawkZoro opened a new issue, #7839:
URL: https://github.com/apache/hudi/issues/7839
**Environment Description**
* Hudi version :
0.12.2
* Spark version :
3.2.2
* Hadoop version :
2.7.3
* Storage :
hdfs
**Describe the problem you faced**
I have a hudi table and I deleted some records, then I clustered it, finally I found that the deleted data reappeared when I check the result.
**To Reproduce**
1. I have a hudi table called cluster_test and delete some records
```
deldelete from cluster_test where id in (2,8,11);
```
the result after delete is :
<img width="877" alt="企业微信截图_e767a9a0-741c-4d83-b25b-bd1c747bf68a" src="https://user-images.githubusercontent.com/32875366/216547022-cda0100d-0d17-4a79-83c5-c1558cfac593.png">
2. then I submit a cluster job
```
spark-submit --class org.apache.hudi.utilities.HoodieClusteringJob hudi-utilities-bundle_2.12-0.12.2.jar \
--props file:///Users/qishuiqing/develop/hudi/clusteringjob.properties \
--mode scheduleAndExecute --base-path 'hdfs://localhost:9000/user/hive/warehouse/hudi.db/cluster_test' \
--table-name cluster_test --parallelism 4 \
--spark-memory 4g
```
the result after cluster is :
<img width="1131" alt="企业微信截图_f9e34400-113c-43e9-9e26-1d3b095b7752" src="https://user-images.githubusercontent.com/32875366/216547899-c7b30c10-93a0-4810-b4f4-518a552feb8c.png">
3. table struct
```
col_name data_type comment
_hoodie_commit_time string
_hoodie_commit_seqno string
_hoodie_record_key string
_hoodie_partition_path string
_hoodie_file_name string
id int
name string
ts bigint
# Detailed Table Information
Database hudi
Table cluster_test
Created By Spark 3.2.2
Type EXTERNAL
Provider hudi
Table Properties [preCombineField=ts, primaryKey=id, type=mor]
Statistics 2173911 bytes
Location hdfs://localhost:9000/user/hive/warehouse/hudi.db/cluster_test
Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
```
**conclusion**
this is a sericous bug needed to be fixed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table
Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7839:
URL: https://github.com/apache/hudi/issues/7839#issuecomment-1420029313
hey @MihawkZoro :
is the delete command issued from spark-sql?
can you post the contents of ".hoodie" directory.
can you post all write configs used.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] MihawkZoro commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table
Posted by "MihawkZoro (via GitHub)" <gi...@apache.org>.
MihawkZoro commented on issue #7839:
URL: https://github.com/apache/hudi/issues/7839#issuecomment-1420123763
> hey @MihawkZoro : is the delete command issued from spark-sql? can you post the contents of ".hoodie" directory. can you post all write configs used.
@nsivabalan hi, configs used:
```
hoodie.table.precombine.field=ts
hoodie.datasource.write.drop.partition.columns=false
hoodie.table.partition.fields=
hoodie.table.type=MERGE_ON_READ
hoodie.archivelog.folder=archived
hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
hoodie.timeline.layout.version=1
hoodie.table.version=5
hoodie.table.metadata.partitions=files
hoodie.table.recordkey.fields=id
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.database.name=hudi
hoodie.table.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.table.name=cluster_test
hoodie.datasource.write.hive_style_partitioning=true
hoodie.table.checksum=1916678829
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"id","type"\:["int","null"]},{"name"\:"name","type"\:["string","null"]},{"name"\:"ts","type"\:["long","null"]}]}
```
there are some timeline data before cluster:
![image](https://user-images.githubusercontent.com/32875366/217136536-9207a877-7fa2-47a8-8448-bef639bcc1e5.png)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] MihawkZoro commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table
Posted by "MihawkZoro (via GitHub)" <gi...@apache.org>.
MihawkZoro commented on issue #7839:
URL: https://github.com/apache/hudi/issues/7839#issuecomment-1420111468
> Hey @MihawkZoro : I could not reproduce on my end. Here is the steps I followed.
>
> 1. Created a table via spark-sql
>
> ```
> create table parquet_tbl1 using parquet location 'file:///tmp/tbl1/*.parquet';
> drop table hudi_ctas_cow1;
> create table hudi_ctas_cow1 using hudi location 'file:/tmp/hudi/hudi_tbl/' options (
> type = 'cow',
> primaryKey = 'tpep_pickup_datetime',
> preCombineField = 'tpep_dropoff_datetime'
> )
> partitioned by (date_col) as select * from parquet_tbl1;
> ```
>
> 2. Read data from one of the partition w/ "VendorId = 1".
>
> ```
> select VendorId, count(*) from hudi_ctas_cow1 where date_col = '2019-08-10' group by 1;
> ```
>
> this returned 1, 1914 2, 3988
>
> 3. Issue deletes to records w/ VendorId = 1 for this specific partition.
>
> ```
> delete from hudi_ctas_cow1 where date_col = '2019-08-10' and VendorID = 1;
> ```
>
> Verified from ".hoodie", that a new commit has succeeded and it added one new parquet file to 2019-08-10 partition.
>
> ```
> ls -ltr /tmp/hudi/hudi_tbl/date_col=2019-08-10/
> total 2192
> -rw-r--r-- 1 nsb wheel 571011 Feb 6 17:19 f5fa2a6c-8128-4591-9f27-94b5b7880a86-0_10-27-119_20230206171846307.parquet
> -rw-r--r-- 1 nsb wheel 529348 Feb 6 17:24 f5fa2a6c-8128-4591-9f27-94b5b7880a86-0_0-83-1538_20230206172355871.parquet
> ```
>
> the 2nd parquet file was written due to the delete operation.
>
> 4. Triggered clustering job.
> Property file contents
>
> ```
> cat /tmp/cluster.props
>
> hoodie.datasource.write.recordkey.field=tpep_pickup_datetime
> hoodie.datasource.write.partitionpath.field=date_col
> hoodie.datasource.write.precombine.field=tpep_dropoff_datetime
>
> hoodie.upsert.shuffle.parallelism=8
> hoodie.insert.shuffle.parallelism=8
> hoodie.delete.shuffle.parallelism=8
> hoodie.bulkinsert.shuffle.parallelism=8
>
> hoodie.clustering.plan.strategy.sort.columns=date_col,tpep_pickup_datetime
> hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
>
> hoodie.parquet.small.file.limit=0
> hoodie.clustering.inline=true
> hoodie.clustering.inline.max.commits=1
> hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824
> hoodie.clustering.plan.strategy.small.file.limit=629145600
> hoodie.clustering.async.enabled=true
> hoodie.clustering.async.max.commits=1
> ```
>
> ```
> ./bin/spark-submit --class org.apache.hudi.utilities.HoodieClusteringJob ~/Downloads/hudi-utilities-bundle_2.11-0.12.2.jar --props /tmp/cluster.props --mode scheduleAndExecute --base-path /tmp/hudi/hudi_tbl/ --table-name hudi_ctas_cow1 --spark-memory 4g
> ```
>
> Verified from ".hoodie" that I could see replace commit and it has succeeded.
>
> 5. re-launched spark-sql and queried the table.
>
> ```
> refresh table hudi_ctas_cow1;
> select VendorId, count(*) from hudi_ctas_cow1 where date_col = '2019-08-10' group by 1;
> ```
>
> output
>
> ```
> 2 3988
> Time taken: 3.818 seconds, Fetched 1 row(s)
> ```
the table is mor
<img width="623" alt="image" src="https://user-images.githubusercontent.com/32875366/217133399-ba18e8be-4b75-4983-9a0d-58787906b222.png">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table
Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7839:
URL: https://github.com/apache/hudi/issues/7839#issuecomment-1473107897
hmmm, interesting. I could not reproduce. can you give me a reproducible script.
https://gist.github.com/nsivabalan/17125d03e56fc5e4d72381536a8ea5ae
I tried joining the snapshot read w/ dataframe that got delete and don't find any matches.
but my clustering configs are very simple. not sure if that plays a part.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] MihawkZoro commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table
Posted by "MihawkZoro (via GitHub)" <gi...@apache.org>.
MihawkZoro commented on issue #7839:
URL: https://github.com/apache/hudi/issues/7839#issuecomment-1420489417
Clustering job property file contents
```
hoodie.clustering.async.enabled=true
hoodie.clustering.async.max.commits=0
hoodie.clustering.plan.strategy.target.file.max.bytes=805306368
hoodie.clustering.plan.strategy.small.file.limit=268435456
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.clustering.plan.strategy.sort.columns=ts
hoodie.clustering.plan.strategy.max.num.groups=100
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codope commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table
Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #7839:
URL: https://github.com/apache/hudi/issues/7839#issuecomment-1522091991
Downgrading the priority as the issue is not reproducible. Please provide more info as requested above.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table
Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7839:
URL: https://github.com/apache/hudi/issues/7839#issuecomment-1474193835
@MihawkZoro : did you try inline clustering. did that also result in deletes re-appearing again.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table
Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7839:
URL: https://github.com/apache/hudi/issues/7839#issuecomment-1420045535
Hey @MihawkZoro : I could not reproduce on my end. Here is the steps I followed.
1. Created a table via spark-sql
```
create table parquet_tbl1 using parquet location 'file:///tmp/tbl1/*.parquet';
drop table hudi_ctas_cow1;
create table hudi_ctas_cow1 using hudi location 'file:/tmp/hudi/hudi_tbl/' options (
type = 'cow',
primaryKey = 'tpep_pickup_datetime',
preCombineField = 'tpep_dropoff_datetime'
)
partitioned by (date_col) as select * from parquet_tbl1;
```
2. Read data from one of the partition w/ "VendorId = 1".
```
select VendorId, count(*) from hudi_ctas_cow1 where date_col = '2019-08-10' group by 1;
```
this returned
1, 1914
2, 3988
3. Issue deletes to records w/ VendorId = 1 for this specific partition.
```
delete from hudi_ctas_cow1 where date_col = '2019-08-10' and VendorID = 1;
```
Verified from ".hoodie", that a new commit has succeeded and it added one new parquet file to 2019-08-10 partition.
```
ls -ltr /tmp/hudi/hudi_tbl/date_col=2019-08-10/
total 2192
-rw-r--r-- 1 nsb wheel 571011 Feb 6 17:19 f5fa2a6c-8128-4591-9f27-94b5b7880a86-0_10-27-119_20230206171846307.parquet
-rw-r--r-- 1 nsb wheel 529348 Feb 6 17:24 f5fa2a6c-8128-4591-9f27-94b5b7880a86-0_0-83-1538_20230206172355871.parquet
```
the 2nd parquet file was written due to the delete operation.
4. Triggered clustering job.
Property file contents
```
cat /tmp/cluster.props
hoodie.datasource.write.recordkey.field=tpep_pickup_datetime
hoodie.datasource.write.partitionpath.field=date_col
hoodie.datasource.write.precombine.field=tpep_dropoff_datetime
hoodie.upsert.shuffle.parallelism=8
hoodie.insert.shuffle.parallelism=8
hoodie.delete.shuffle.parallelism=8
hoodie.bulkinsert.shuffle.parallelism=8
hoodie.clustering.plan.strategy.sort.columns=date_col,tpep_pickup_datetime
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.parquet.small.file.limit=0
hoodie.clustering.inline=true
hoodie.clustering.inline.max.commits=1
hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824
hoodie.clustering.plan.strategy.small.file.limit=629145600
hoodie.clustering.async.enabled=true
hoodie.clustering.async.max.commits=1
```
```
./bin/spark-submit --class org.apache.hudi.utilities.HoodieClusteringJob ~/Downloads/hudi-utilities-bundle_2.11-0.12.2.jar --props /tmp/cluster.props --mode scheduleAndExecute --base-path /tmp/hudi/hudi_tbl/ --table-name hudi_ctas_cow1 --spark-memory 4g
```
Verified from ".hoodie" that I could see replace commit and it has succeeded.
5. re-launched spark-sql and queried the table.
```
refresh table hudi_ctas_cow1;
select VendorId, count(*) from hudi_ctas_cow1 where date_col = '2019-08-10' group by 1;
```
output
```
2 3988
Time taken: 3.818 seconds, Fetched 1 row(s)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] ad1happy2go commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table
Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.
ad1happy2go commented on issue #7839:
URL: https://github.com/apache/hudi/issues/7839#issuecomment-1486674244
@MihawkZoro I also tried to reproduce with extract config and cluster config you provided but unable to reproduce the same. Are you still facing this issue?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org