You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/25 21:13:40 UTC

[GitHub] [hudi] namuny opened a new issue, #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

namuny opened a new issue, #6212:
URL: https://github.com/apache/hudi/issues/6212

   **Summary**
   During clustering, Hudi creates duplicate parquet file with the same file group ID and identical content. One of the two files are later marked as a duplicate and deleted. I'm using inline clustering with a single writer, so there's no concurrency issues at play.
   
   
   **Details**
   ![Screen Shot 2022-07-26 at 8 59 27 am](https://user-images.githubusercontent.com/12869068/180874304-48221e99-d2a5-4f4f-b40b-7c5536db9f08.png)
   ![Screen Shot 2022-07-26 at 8 59 17 am](https://user-images.githubusercontent.com/12869068/180874316-f9e249b9-1c2e-41d7-96d1-4fb87bf3e9c9.png)
   
   The two spark jobs above are triggered during inline clustering.
   
   ![Screen Shot 2022-07-26 at 9 00 59 am](https://user-images.githubusercontent.com/12869068/180874446-98c0013f-95bc-4ccc-af3c-b84ed503aeec.png)
   ![Screen Shot 2022-07-26 at 9 00 55 am](https://user-images.githubusercontent.com/12869068/180874448-818716c8-9d07-42ba-911b-ae7190d423bc.png)
   
   As can be seen above, both of these Spark jobs trigger `MultipleSparkJobExecutionStrategy.performClustering` method. This ends up creating and storing two identical clustered files.
   
   When `HoodieTable.reconcileAgainstMarkers` runs, the newer file is identified as a duplicate, and is deleted.
   
   **Expected behavior**
   
   Hudi should only store the clustered file once. Storing a duplicate of the file unnecessarily increases the duration.
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 3.1.2
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   I'm using Copy on Write and inline clustering. My write config is:
   
   ```
   .write
         .format(HUDI_WRITE_FORMAT)
         .option(TBL_NAME.key(), tableName)
         .option(TABLE_TYPE.key(), COW_TABLE_TYPE_OPT_VAL)
         .option(PARTITIONPATH_FIELD.key(), ...)
         .option(PRECOMBINE_FIELD.key(), ...)
         .option(COMBINE_BEFORE_INSERT.key(), "true")
         .option(KEYGENERATOR_CLASS_NAME.key(), CUSTOM_KEY_GENERATOR)
         .option(URL_ENCODE_PARTITIONING.key(), "true")
         .option(HIVE_SYNC_ENABLED.key(), "true")
         .option(HIVE_DATABASE.key(), ...)
         .option(HIVE_PARTITION_FIELDS.key(), ...)
         .option(HIVE_TABLE.key(), tableName)
         .option(HIVE_TABLE_PROPERTIES.key(), tableName)
         .option(HIVE_PARTITION_EXTRACTOR_CLASS.key(), MULTI_PART_KEYS_VALUE_EXTRACTOR)
         .option(HIVE_USE_JDBC.key(), "false")
         .option(HIVE_SUPPORT_TIMESTAMP_TYPE.key(), "true")
         .option(HIVE_STYLE_PARTITIONING.key(), "true")
         .option(KeyGeneratorOptions.Config.TIMESTAMP_TYPE_FIELD_PROP, INPUT_TIMESTAMP_TYPE)
         .option(KeyGeneratorOptions.Config.INPUT_TIME_UNIT, INPUT_TIMESTAMP_UNIT)
         .option(KeyGeneratorOptions.Config.TIMESTAMP_OUTPUT_DATE_FORMAT_PROP, OUTPUT_TIMESTAMP_FORMAT)
         .option(OPERATION.key(), UPSERT_OPERATION_OPT_VAL)
         .option(INLINE_CLUSTERING.key(), "true")
         .option(INLINE_CLUSTERING_MAX_COMMITS.key(), "2")
         .option(PLAN_STRATEGY_SMALL_FILE_LIMIT.key(), "73400320") // 70MB
         .option(PLAN_STRATEGY_TARGET_FILE_MAX_BYTES.key(), "209715200") // 200MB
         .option(COPY_ON_WRITE_RECORD_SIZE_ESTIMATE.key(), ...)
         .option(PARQUET_MAX_FILE_SIZE.key(), "104857600") // 100MB
         .option(PARQUET_SMALL_FILE_LIMIT.key(), "104857600") // 100MB
         .option(PARALLELISM_VALUE.key(), getParallelism().toString)
         .option(FILE_LISTING_PARALLELISM_VALUE.key(), getParallelism().toString)
         .option(FINALIZE_WRITE_PARALLELISM_VALUE.key(), getParallelism().toString)
         .option(DELETE_PARALLELISM_VALUE.key(), getParallelism().toString)
         .option(INSERT_PARALLELISM_VALUE.key(), getParallelism().toString)
         .option(ROLLBACK_PARALLELISM_VALUE.key(), getParallelism().toString)
         .option(UPSERT_PARALLELISM_VALUE.key(), getParallelism().toString)
         .option(INDEX_TYPE.key(), indexType)
         .option(SIMPLE_INDEX_PARALLELISM.key(), getParallelism().toString)
         .option(AUTO_CLEAN.key(), "true")
         .option(CLEANER_PARALLELISM_VALUE.key(), getParallelism().toString)
         .option(CLEANER_COMMITS_RETAINED.key(), "10")
         .option(PRESERVE_COMMIT_METADATA.key(), "true")
         .option(HoodieMetadataConfig.ENABLE.key(), "true")
         .option(META_SYNC_CONDITIONAL_SYNC.key(), "false")
         .option(ROLLBACK_PENDING_CLUSTERING_ON_CONFLICT.key(), "true")
         .option(UPDATES_STRATEGY.key(), "org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy")
         .option(MARKERS_TYPE.key(), MarkerType.DIRECT.toString)
         .mode(SaveMode.Append)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] HEPBO3AH commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

HEPBO3AH commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1201765923

   @xushiyan any updates on this? We seem to be impacted as well, observing the same behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] HEPBO3AH commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

HEPBO3AH commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1217241460

   > increase the INLINE_CLUSTERING_MAX_COMMITS.key() to say 3. and lets see if we encounter this issue.
   
   It is still happening.
   
   > can we add a delay of say 30 secs between successive commits in your for loop.
   
   Adding 30 seconds sleep between runs didn't solve the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1233734308

   Here is the fix https://github.com/apache/hudi/pull/6561 
   Can you verify w/ the patch, you don't see such duplicates. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1225242795

   give me two days. I am gonna take a stab at this and will update here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

yihua commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1201815501

   @HEPBO3AH It looks like the clustering DAG is triggered twice.  This needs reproducing.  Could you also provide the timeline under `<base_path>/.hoodie`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] namuny commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

namuny commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1234864229

   Thank you! Will verify the fix today/tomorrow and report back.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] namuny commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

namuny commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1225170438

   Gentle bump to see if anyone has any further recommendations on what information we could provide to help with reproducing the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] namuny commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

namuny commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1232202801

   Hi @nsivabalan, sorry to be a hassle but do you have any updates on this one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1210124453

   @HEPBO3AH : 
   Can we try few things. 
   - increase the `INLINE_CLUSTERING_MAX_COMMITS.key()` to say 3. and lets see if we encounter this issue. 
   - can we add a delay of say 30 secs between successive commits in your for loop. 
   
   Let us know if the issue is still reproducible for above cases as well. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1233730770

   yes, I could able to reproduce :( 
   https://issues.apache.org/jira/browse/HUDI-4760
   will put up a fix shortly. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1237643717

   thanks! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] HEPBO3AH commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

HEPBO3AH commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1203249824

   Hi,
   
   Here is the code sample to replicate it:
   
   ```
       val spark = SparkSession
         .builder()
         .master("local[3]")
         .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
         .config("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
         .config("spark.eventLog.enabled", "true")
         .config("spark.eventLog.dir", logPath)
         .getOrCreate();
   
       import spark.implicits._
   
       val ids = 1 to 5000 grouped(1000) toSeq
   
       for (idSection <- ids) {
   
         val df = idSection.toDF("id")
   
         df
           .write
           .format("org.apache.hudi")
           .option(TBL_NAME.key(), tableName)
           .option(TABLE_TYPE.key(), COW_TABLE_TYPE_OPT_VAL)
           .option(RECORDKEY_FIELD.key(), "id")
           .option(PARTITIONPATH_FIELD.key(), "")
           .option(OPERATION.key(),  INSERT_OPERATION_OPT_VAL)
           .option(INLINE_CLUSTERING.key(), "true")
           .option(INLINE_CLUSTERING_MAX_COMMITS.key(), "1")
   
           .mode(SaveMode.Append)
           .save(path)
       }
   ```
   
   What we see in `.hoodie`:
   
   <img width="994" alt="image" src="https://user-images.githubusercontent.com/15118722/182479505-e5567179-6f80-4e7d-88f9-e463458b9a06.png">
   
   The files that were double created and then deleted. The delete markers correspond to the the clustering commit ids:
   
   <img width="809" alt="image" src="https://user-images.githubusercontent.com/15118722/182479683-f1c95881-b0ba-4a62-845e-c4d0390a3483.png">
   
   These have 1 to 1 relationship with clustering. If you set clustering to happen every 2 commits, only 2 delete markers will be present - one for each clustering as we do 5 passes in total in the code sample.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1216658622

   @HEPBO3AH : any updates please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] namuny commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

namuny commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1234922971

   Did a quick test with your fix. I don't see any duplicates anymore during clustering! 👯 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

Posted by GitBox <gi...@apache.org>.

nsivabalan closed issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering
URL: https://github.com/apache/hudi/issues/6212


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org