You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by selvaraj periyasamy <se...@gmail.com> on 2020/10/21 06:31:25 UTC

Deleting Hudi Partitons

Team ,

I have a COW table which has sub partition columns
Date/Hour . For some of the use case , I need to totally remove free
petitions (removing few hours alone) . Hudi maintains metadata info.
Manually removing folders as well as in hive megastore , may mess up hudi
metadata. What is the best way to do this?


Thanks,
Selva

Re: Deleting Hudi Partitons

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 
Fixing incorrect Satish's email.    On Wednesday, October 21, 2020, 06:19:43 PM PDT, Balaji Varadarajan <v....@ymail.com.invalid> wrote:  
 
  cc Satish who implemented Insert Overwrite support.
We have recently landed Insert Overwrite support in Hudi. Partition level deletion is a logical extension of this feature but not currently available yet.  I have added a jira to track this : https://issues.apache.org/jira/browse/HUDI-1350
Meanwhile, using master branch, you can do this in 2 steps. You can generate a record for each partition you want to delete and commit the batch. This would essentially truncate the partition to 1 record. You can then issue a hard delete on that record.  By keeping cleaner retention to 1, you can essentially cleanup the files in the directory. Satish - Can you chime in and see if this makes sense and if you are seeing any issues with this ?
Thanks,Balaji.V 
    On Tuesday, October 20, 2020, 11:31:45 PM PDT, selvaraj periyasamy <se...@gmail.com> wrote:  
 
 Team ,

I have a COW table which has sub partition columns
Date/Hour . For some of the use case , I need to totally remove free
petitions (removing few hours alone) . Hudi maintains metadata info.
Manually removing folders as well as in hive megastore , may mess up hudi
metadata. What is the best way to do this?


Thanks,
Selva
    

Re: Deleting Hudi Partitons

Posted by Satish Kotha <sa...@uber.com.INVALID>.
Yes, that would work. You would typically add below option on dataframe to
use insert overwrite  (InsertOverwrite is a new API, I haven't updated
documentation yet).

   - hoodie.datasource.write.operation: insert_overwrite


Let me know if you have any questions.

@Balaji Thanks for creating the follow up ticket. Agree this can be
supported in a much simpler way using insert_overwrite primitive.

On Wed, Oct 21, 2020 at 6:19 PM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>  cc Satish who implemented Insert Overwrite support.
> We have recently landed Insert Overwrite support in Hudi. Partition level
> deletion is a logical extension of this feature but not currently available
> yet.  I have added a jira to track this :
> https://issues.apache.org/jira/browse/HUDI-1350
> Meanwhile, using master branch, you can do this in 2 steps. You can
> generate a record for each partition you want to delete and commit the
> batch. This would essentially truncate the partition to 1 record. You can
> then issue a hard delete on that record.  By keeping cleaner retention to
> 1, you can essentially cleanup the files in the directory. Satish - Can you
> chime in and see if this makes sense and if you are seeing any issues with
> this ?
> Thanks,Balaji.V
>     On Tuesday, October 20, 2020, 11:31:45 PM PDT, selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
>  Team ,
>
> I have a COW table which has sub partition columns
> Date/Hour . For some of the use case , I need to totally remove free
> petitions (removing few hours alone) . Hudi maintains metadata info.
> Manually removing folders as well as in hive megastore , may mess up hudi
> metadata. What is the best way to do this?
>
>
> Thanks,
> Selva
>

Re: Deleting Hudi Partitons

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 cc Satish who implemented Insert Overwrite support.
We have recently landed Insert Overwrite support in Hudi. Partition level deletion is a logical extension of this feature but not currently available yet.  I have added a jira to track this : https://issues.apache.org/jira/browse/HUDI-1350
Meanwhile, using master branch, you can do this in 2 steps. You can generate a record for each partition you want to delete and commit the batch. This would essentially truncate the partition to 1 record. You can then issue a hard delete on that record.  By keeping cleaner retention to 1, you can essentially cleanup the files in the directory. Satish - Can you chime in and see if this makes sense and if you are seeing any issues with this ?
Thanks,Balaji.V 
    On Tuesday, October 20, 2020, 11:31:45 PM PDT, selvaraj periyasamy <se...@gmail.com> wrote:  
 
 Team ,

I have a COW table which has sub partition columns
Date/Hour . For some of the use case , I need to totally remove free
petitions (removing few hours alone) . Hudi maintains metadata info.
Manually removing folders as well as in hive megastore , may mess up hudi
metadata. What is the best way to do this?


Thanks,
Selva