You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/23 20:15:54 UTC

[GitHub] [hudi] luffyd opened a new issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

luffyd opened a new issue #1866:
URL: https://github.com/apache/hudi/issues/1866


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? yes
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I have a simple while(true) in which I have been committing data to hudi for MOR table. I was able to add successfully >1000 commits.
   But I noticed the table size kept growing continuously and no clean jobs have been run.
   
   Used hudi cli command `cleans show` to confirm this
   
   I did run `cleans run` and it cleaned lots of data. After the run data size became 500GB from 25TB .
   
   **To Reproduce**
   
   
   Steps to reproduce the behavior:
   
   Code snippet to run:
        val startTime = System.currentTimeMillis()
   
         val parallelism = options.getOrElse("parallelism", Math.max(2, upsertCount/100000).toString).toInt
         println("parallelism", parallelism)
         (inputDF
           .write
           .format("org.apache.hudi")
           .option(HoodieWriteConfig.TABLE_NAME, options.getOrElse("tableName", "facestest"))
           .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
           .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partitionKey")
           .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "nodeId")
           .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts")
           .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
           .option("hoodie.upsert.shuffle.parallelism", parallelism)//not the ideal way but works
           .option("hoodie.bulkinsert.shuffle.parallelism", parallelism)//not the ideal way but works
   
           .mode(SaveMode.Append)
           .save(getHudiPath(spark)))
   
         val endTime = System.currentTimeMillis()
         val diff = endTime - startTime
         timings = diff :: timings
         CloudWatchWriter.addTimeMetric("faceUpsertTimeForLoop_"+run, diff, spark.sparkContext.isLocal)
   
   **Expected behavior**
   
   Clean jobs should have ran.
   
   **Environment Description**
   
   * Hudi version : 0.5.3
   
   * Spark version : 2.4
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Adding some loggin and used hudi-cli to rectify the situation.
   
   I did run `cleans run` and it cleaned lots of data. After the run data size became 500GB from 25TB .
   
   I am guessing this line is resolving to false and clean up was never triggered.
   https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L354
   
   **Stacktrace**
   
   ***Hudi Trace logs***
   ```20/07/21 23:09:10 WARN IncrementalTimelineSyncFileSystemView: Incremental Sync of timeline is turned off or deemed unsafe. Will revert to full syncing
   20/07/21 23:09:10 INFO FileSystemViewHandler: TimeTakenMillis[Total=2647, Refresh=2646, handle=1, Check=0], Success=true, Query=basepath=s3%3A%2F%2Fchelan-dev-mock-faces%2FTestFacesUpserForLoop%2Feight&lastinstantts=20200721225935&timelinehash=44494fd19bd1c8ab3d67910abfd9
   06fe922a8c118ec2279044f586278833480b, Host=ip-10-0-1-147.us-west-2.compute.internal:41753, synced=true
   20/07/21 23:09:10 INFO CleanPlanner: No earliest commit to retain. No need to scan partitions !!
   20/07/21 23:09:10 INFO CleanActionExecutor: Nothing to clean here. It is already clean
   ```
   
   *** Some custom logging in the application***
   ```
    val metaClient = new HoodieTableMetaClient(spark.sparkContext.hadoopConfiguration, getHudiPath(spark), true)
        println("metaClient.getActiveTimeline().countInstants()", metaClient.getActiveTimeline().countInstants())
        println("metaClient.getCommitTimeline.filterCompletedInstants.countInstants()", metaClient.getCommitTimeline.filterCompletedInstants.countInstants())
        println("metaClient.getCommitTimeline.filterCompletedAndCompactionInstants.countInstants()", metaClient.getCommitTimeline.filterCompletedAndCompactionInstants().countInstants())
   ```
   
   Results for above:
   ```
   (metaClient.getActiveTimeline().countInstants(),33)
   (metaClient.getCommitTimeline.filterCompletedInstants.countInstants(),4)
   (metaClient.getCommitTimeline.filterCompletedAndCompactionInstants.countInstants(),4)
   ```
    
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] luffyd commented on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

Posted by GitBox <gi...@apache.org>.

luffyd commented on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663648589


   Thanks saitsh,
   I have inline turned on by default, Now I see cleans did happen! Is there a possibility that commits get archived before clean job is resulting in a noop. I will continue to monitor. 
   
   Also can you confirm If I can run a clean job in a separate spark job concurrently while streaming write is happening, guess it should be fine as compaction runs have that ability
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] luffyd edited a comment on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

Posted by GitBox <gi...@apache.org>.

luffyd edited a comment on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663741729


   Ok thanks
   No I was not thinking to run as separate process continuously, but I wanted to execute "clean commands" from cli o that my streaming tests progress faster.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] satishkotha commented on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

Posted by GitBox <gi...@apache.org>.

satishkotha commented on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663683323


   > Is there a possibility that commits get archived before clean job is resulting in a noop. I will continue to monitor.
   clean and archival are somewhat independent. So noop should not happen.
   
   > Also can you confirm If I can run a clean job in a separate spark job concurrently while streaming write is happening, guess it should be fine as compaction runs have that ability
   Why are you considering separate spark job for clean? Are you seeing clean take a lot of time? You can consider running clean concurrently with write by setting 'hoodie.clean.async' to true. (This runs clean in same job, but concurrently with write). 
   
   I don't know of anyone using separate spark job to run clean. Theoretically, I think it is possible. But you may have to do some testing because it isn't used like this afaik.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] satishkotha edited a comment on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

Posted by GitBox <gi...@apache.org>.

satishkotha edited a comment on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663683323


   > Is there a possibility that commits get archived before clean job is resulting in a noop. I will continue to monitor.
   
   clean and archival are somewhat independent today. So this 'noop' should not happen.
   
   > Also can you confirm If I can run a clean job in a separate spark job concurrently while streaming write is happening, guess it should be fine as compaction runs have that ability
   
   Why are you considering separate spark job for clean? Are you seeing clean take a lot of time? You can consider running clean concurrently with write by setting 'hoodie.clean.async' to true. (This runs clean in same job, but concurrently with write). 
   
   I don't know of anyone using separate spark job to run clean. Theoretically, I think it is possible. But you may have to do some testing because it isn't used like this afaik.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar closed issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

Posted by GitBox <gi...@apache.org>.

bvaradar closed issue #1866:
URL: https://github.com/apache/hudi/issues/1866


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] luffyd commented on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

Posted by GitBox <gi...@apache.org>.

luffyd commented on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-668667895


   Please resolve this, Cleans are happening fine.
   
   I also added, I think it comes at the expense of timeline feature. We will relax it later
   `
   val compactionConfig = HoodieCompactionConfig.newBuilder()
         .withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS)
         .retainFileVersions(1)
         .build()
       val writerConfig = HoodieWriteConfig.newBuilder()
         .withCompactionConfig(compactionConfig)
         .withPath(getHudiPath(spark))
         .build()
       val writeClient = new HoodieWriteClient(spark.sparkContext, writerConfig)
   
       // Run cleaner
       val cleanStats = writeClient.clean()
   `
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] satishkotha commented on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

Posted by GitBox <gi...@apache.org>.

satishkotha commented on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663751174


   Sounds good. Please try it and let me know if you see any issues.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] satishkotha edited a comment on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

Posted by GitBox <gi...@apache.org>.

satishkotha edited a comment on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663683323


   > Is there a possibility that commits get archived before clean job is resulting in a noop. I will continue to monitor.
   
   clean and archival are somewhat independent. So noop should not happen.
   
   > Also can you confirm If I can run a clean job in a separate spark job concurrently while streaming write is happening, guess it should be fine as compaction runs have that ability
   Why are you considering separate spark job for clean? Are you seeing clean take a lot of time? You can consider running clean concurrently with write by setting 'hoodie.clean.async' to true. (This runs clean in same job, but concurrently with write). 
   
   I don't know of anyone using separate spark job to run clean. Theoretically, I think it is possible. But you may have to do some testing because it isn't used like this afaik.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] luffyd commented on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

Posted by GitBox <gi...@apache.org>.

luffyd commented on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663741729


   Ok thanks, I will be running "clean commands" from hudi cli so that my tests progress faster for streaming.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] satishkotha edited a comment on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

Posted by GitBox <gi...@apache.org>.

satishkotha edited a comment on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663298411


   Hi @luffyd  
   
   By default, upsert on MOR tables creates 'deltacommits'.  [Compaction](https://cwiki.apache.org/confluence/display/HUDI/Design+And+Architecture#DesignAndArchitecture-Compaction) needs to run to convert deltacommits into commits. Clean works only after compaction runs and commits are created. Clean also does not remove file groups that have pending compaction.  Can you setup inline compaction [using instructions here](https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoIruncompactionforaMORdataset) for testing and see if that helps?
   
   If that doesn't work, can you share screenshot of files in .hoodie folder in 'getHudiPath'


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] satishkotha commented on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

Posted by GitBox <gi...@apache.org>.

satishkotha commented on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663298411


   Hi @luffyd  
   
   By default, upsert on MOR tables creates 'deltacommits'.   [compaction] (https://cwiki.apache.org/confluence/display/HUDI/Design+And+Architecture#DesignAndArchitecture-Compaction) needs to run to convert deltacommits into commits. Clean works only after compaction runs and commits are created. Clean also does not remove file groups that have pending compaction.  Can you setup inline compaction [using instructions here](https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoIruncompactionforaMORdataset) for testing and see if that helps?
   
   If that doesn't work, can you share screenshot of files in .hoodie folder in 'getHudiPath'


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] satishkotha edited a comment on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

Posted by GitBox <gi...@apache.org>.

satishkotha edited a comment on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663683323


   > Is there a possibility that commits get archived before clean job is resulting in a noop. I will continue to monitor.
   
   clean and archival are somewhat independent today. So this 'noop' should not happen.
   
   > Also can you confirm If I can run a clean job in a separate spark job concurrently while streaming write is happening, guess it should be fine as compaction runs have that ability
   Why are you considering separate spark job for clean? Are you seeing clean take a lot of time? You can consider running clean concurrently with write by setting 'hoodie.clean.async' to true. (This runs clean in same job, but concurrently with write). 
   
   I don't know of anyone using separate spark job to run clean. Theoretically, I think it is possible. But you may have to do some testing because it isn't used like this afaik.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org