You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "MihawkZoro (via GitHub)" <gi...@apache.org> on 2023/02/07 02:36:12 UTC

[GitHub] [hudi] MihawkZoro commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table

MihawkZoro commented on issue #7839:
URL: https://github.com/apache/hudi/issues/7839#issuecomment-1420111468

   > Hey @MihawkZoro : I could not reproduce on my end. Here is the steps I followed.
   > 
   > 1. Created a table via spark-sql
   > 
   > ```
   > create table parquet_tbl1 using parquet location 'file:///tmp/tbl1/*.parquet';
   > drop table hudi_ctas_cow1;
   > create table hudi_ctas_cow1 using hudi location 'file:/tmp/hudi/hudi_tbl/' options (
   >   type = 'cow',
   >   primaryKey = 'tpep_pickup_datetime',
   >   preCombineField = 'tpep_dropoff_datetime'
   >  )
   > partitioned by (date_col) as select * from parquet_tbl1;
   > ```
   > 
   > 2. Read data from one of the partition w/ "VendorId = 1".
   > 
   > ```
   > select VendorId, count(*) from hudi_ctas_cow1 where date_col = '2019-08-10' group by 1;
   > ```
   > 
   > this returned 1, 1914 2, 3988
   > 
   > 3. Issue deletes to records w/ VendorId = 1 for this specific partition.
   > 
   > ```
   > delete from hudi_ctas_cow1 where date_col = '2019-08-10' and VendorID = 1;
   > ```
   > 
   > Verified from ".hoodie", that a new commit has succeeded and it added one new parquet file to 2019-08-10 partition.
   > 
   > ```
   > ls -ltr /tmp/hudi/hudi_tbl/date_col=2019-08-10/
   > total 2192
   > -rw-r--r--  1 nsb  wheel  571011 Feb  6 17:19 f5fa2a6c-8128-4591-9f27-94b5b7880a86-0_10-27-119_20230206171846307.parquet
   > -rw-r--r--  1 nsb  wheel  529348 Feb  6 17:24 f5fa2a6c-8128-4591-9f27-94b5b7880a86-0_0-83-1538_20230206172355871.parquet
   > ```
   > 
   > the 2nd parquet file was written due to the delete operation.
   > 
   > 4. Triggered clustering job.
   >    Property file contents
   > 
   > ```
   > cat /tmp/cluster.props 
   > 
   > hoodie.datasource.write.recordkey.field=tpep_pickup_datetime
   > hoodie.datasource.write.partitionpath.field=date_col
   > hoodie.datasource.write.precombine.field=tpep_dropoff_datetime
   > 
   > hoodie.upsert.shuffle.parallelism=8
   > hoodie.insert.shuffle.parallelism=8
   > hoodie.delete.shuffle.parallelism=8
   > hoodie.bulkinsert.shuffle.parallelism=8
   > 
   > hoodie.clustering.plan.strategy.sort.columns=date_col,tpep_pickup_datetime
   > hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
   > 
   > hoodie.parquet.small.file.limit=0
   > hoodie.clustering.inline=true
   > hoodie.clustering.inline.max.commits=1
   > hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824
   > hoodie.clustering.plan.strategy.small.file.limit=629145600
   > hoodie.clustering.async.enabled=true
   > hoodie.clustering.async.max.commits=1
   > ```
   > 
   > ```
   > ./bin/spark-submit --class org.apache.hudi.utilities.HoodieClusteringJob ~/Downloads/hudi-utilities-bundle_2.11-0.12.2.jar --props /tmp/cluster.props --mode scheduleAndExecute --base-path /tmp/hudi/hudi_tbl/ --table-name hudi_ctas_cow1 --spark-memory 4g
   > ```
   > 
   > Verified from ".hoodie" that I could see replace commit and it has succeeded.
   > 
   > 5. re-launched spark-sql and queried the table.
   > 
   > ```
   > refresh table hudi_ctas_cow1;
   > select VendorId, count(*) from hudi_ctas_cow1 where date_col = '2019-08-10' group by 1;
   > ```
   > 
   > output
   > 
   > ```
   > 2	3988
   > Time taken: 3.818 seconds, Fetched 1 row(s)
   > ```
   
   the table is mor
   <img width="623" alt="image" src="https://user-images.githubusercontent.com/32875366/217133399-ba18e8be-4b75-4983-9a0d-58787906b222.png">


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org