You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/05/31 23:24:54 UTC
[GitHub] [hudi] n3nash commented on issue #2995: [SUPPORT] Upserts creating duplicates after enabling metadata table in Hudi 0.7 indexing pipeline

n3nash commented on issue #2995:
URL: https://github.com/apache/hudi/issues/2995#issuecomment-851715268


   @jtmzheng Thanks for the very detailed information that helps to understand the problem. Let me answer some of your questions inline before attempting to debug the underlying root cause of duplicates. 
   
   > **Expected behavior**
   > 
   > Same behavior as Hudi 0.6 but now using the metadata table to track files/partitions. Happy to provide whatever info I can.
   > 
   > Questions:
   > 
   > 1. What is causing these duplicates to occur? Since no errors happened as far as I can tell, what info can I look at to debug/RCA? I’ve verified there are no duplicates (ie. checked some partitions) on 0.6 dataset.
   Can you check if the files from the dataset is the same as in the metadata table ? This would require you to 
   
   1. Perform a listing on the entire dataset and get the unique files
   2. Read the metadata table and get the list of files from there
   3. Diff the 2 results to see if there are files in your dataset which are not present in the metadata table. 
   
   Additionally, if you have the logs from the first application where you turned on the `hoodie.metadata.enable` flag to true. Can you grep for the following log lines:
   
   `Creating a new metadata table in`
   `Initializing metadata table by using file listings in`
   `files to metadata`
   
   Please share the output of the grep of the above lines. 
   
   > 2. How can the metadata table be inspected? I can’t tell from https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements
   
   You can use the `MetadataCommand` to inspect this using the Hudi CLI. You can read about how to initialize the CLI here -> https://hudi.apache.org/docs/deployment.html#cli
   
   > 3. Should `hoodie.metadata.validate` be enabled? My understanding is this is a “dry run” config where S3 file listing will still happen as before while also updating the metadata table
   
   That is correct. In this scenario, enabling that before the duplicates happened would've helped. But it doesn't help to enable this after the corruption has happened. At a high level, you don't need to enable it unless this issue is fully reproducible. 
   
   > 4. How do we recover when duplicates occur? I see “records deduplicate” is suggested in https://hudi.apache.org/docs/deployment.html#duplicates (NB: seems like this should be “repair deduplicate”?), do we need to turn off ingestion first and then run over every affected partition?
   
   Yes. You will need to turn off ingestion first before deduping. Unfortunately, this command is pretty old and has not been maintained since we don't see duplicate issues. You might need to re-bootstrap you dataset to recover this or start the shadow pipeline fresh. 
   
   > 5. How do we recover if the metadata table is corrupted? Should we delete the existing metadata table from the CLI and recreate? Is this safe to do?
   
   Yes. You can delete the existing metadata table using the CLI or manually but you need to stop ingestion during this time. It is safe to do this. Then, you should disable `hoodie.metadata.enable` if there is a corruption since the code may have had bug. After this, you can resume your pipeline. 
   
   > 6. What upgrade path is suggested from 0.6 to 0.7 with metadata table enabled? Should the metadata table be created from the CLI pre-ingestion and then starting up the consumer after?
   
   As long as you don't have concurrently running jobs, simply turning on the metadata table before the next ingestion run suffices. No need to stop the job. 
   
   Please share your logs to help debug this problem further. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org