You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@oozie.apache.org by Martin Chalupa <mc...@vendavo.com> on 2015/07/13 23:15:45 UTC

HDFS cleanup after certain time

Hello everyone,

I think how to solve following problem. I have an oozie workflow which produce some intermediate results and some final results on HDFS. I would like to ensure that those files will be deleted after certain time. I would like to achieve that just with oozie and hadoop ecosystem. My workflow gets working directory as an input so I know that all files will be created within this directory. My idea is that I would create coordinator job in the first step in the workflow. This coordinator will be configured to fire exactly once after configured period. The coordinator will execute very simple oozie workflow which will just remove given working directory.

What do you think about this approach?

I know that there is no support for creating coordinator within workflow so I have to implement that probably as a java action. Also it means that for each workflow there will be one coordinator. Is there any limit for how much coordinators can be active?

Thank you
Martin

RE: HDFS cleanup after certain time

Posted by Jo...@thomsonreuters.com.
Hello Martin,

I am in the final testing of a similar tool. It will allow anyone to specify path and time range for deletion. I will let everyone know when it is available. 

I will also be taking a look at Apache Falcon.

Thank you,


Joe

-----Original Message-----
From: Flavio Pompermaier [mailto:pompermaier@okkam.it] 
Sent: Monday, July 13, 2015 4:19 PM
To: user@oozie.apache.org
Subject: Re: HDFS cleanup after certain time

Have you ever looked at Apache Falcon?
On 13 Jul 2015 23:15, "Martin Chalupa" <mc...@vendavo.com> wrote:

> Hello everyone,
>
> I think how to solve following problem. I have an oozie workflow which 
> produce some intermediate results and some final results on HDFS. I 
> would like to ensure that those files will be deleted after certain 
> time. I would like to achieve that just with oozie and hadoop 
> ecosystem. My workflow gets working directory as an input so I know 
> that all files will be created within this directory. My idea is that 
> I would create coordinator job in the first step in the workflow. This 
> coordinator will be configured to fire exactly once after configured 
> period. The coordinator will execute very simple oozie workflow which will just remove given working directory.
>
> What do you think about this approach?
>
> I know that there is no support for creating coordinator within 
> workflow so I have to implement that probably as a java action. Also 
> it means that for each workflow there will be one coordinator. Is 
> there any limit for how much coordinators can be active?
>
> Thank you
> Martin

Re: HDFS cleanup after certain time

Posted by Flavio Pompermaier <po...@okkam.it>.
Have you ever looked at Apache Falcon?
On 13 Jul 2015 23:15, "Martin Chalupa" <mc...@vendavo.com> wrote:

> Hello everyone,
>
> I think how to solve following problem. I have an oozie workflow which
> produce some intermediate results and some final results on HDFS. I would
> like to ensure that those files will be deleted after certain time. I would
> like to achieve that just with oozie and hadoop ecosystem. My workflow gets
> working directory as an input so I know that all files will be created
> within this directory. My idea is that I would create coordinator job in
> the first step in the workflow. This coordinator will be configured to fire
> exactly once after configured period. The coordinator will execute very
> simple oozie workflow which will just remove given working directory.
>
> What do you think about this approach?
>
> I know that there is no support for creating coordinator within workflow
> so I have to implement that probably as a java action. Also it means that
> for each workflow there will be one coordinator. Is there any limit for how
> much coordinators can be active?
>
> Thank you
> Martin