You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@falcon.apache.org by John Smith <le...@gmail.com> on 2016/01/22 18:55:16 UTC

lifecycle - retention

Hello,

I found that Falcon supports retention policy as part of the Lifecycle. I
am wondering how is it working, because its not clear to me by reading the
documentation.

Assume I store one file  (with thousands/million of records) into HDFS and
I set retention period for 1 year.

How is that retention period enforced on the records inside the file? Does
it mean that scheduler executes some "flow" that reads record by record of
the stored file every day and check the current date agains retention date?
In case the current date >= retention date the record is removed. Is it
cpu/time consuming? Each check requires the full file scan?

What will happen in scenario when I define different retention dates per
field?



Thank you!

Best,
John

Re: lifecycle - retention

Posted by Suresh Srinivas <su...@hortonworks.com>.
Sowmya, awesome and detailed! Thank you and you should encourage others to
do this too.

On 1/22/16, 12:20 PM, "Sowmya Ramesh" <sr...@hortonworks.com> wrote:

>Hi John,
>
>Retention policy determines how long the data will remain on the cluster.
>
>Falcon kicks off the retention policy on the basis of the time value you
>specify in the retention limit:
>
>* Less than 24 hours: Falcon kicks off the retention policy job every 6
>hours
>* More than 24 hours: Falcon kicks off the retention policy job every 24
>hours
>
>When a feed is scheduled Falcon kicks off the retention policy
>immediately. When job runs, it deletes everything thats eligible for
>eviction - eligibility criteria is the date pattern on the partition and
>NOT creation date. For e.g. if the retention limit is 90 days then
>retention job consistently deletes files older than 90 days.
>
>I don¹t understand what do you mean by records inside the file. I am
>assuming you mean files within a directory.
>
>For retention, Falcon expects data to be in dated partitions. I will try
>to explain the retention policy logic with an example.
>Lets say your feed location is defined as below:
>
><locations>
>        <location type=³data"
>path=³/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
>        <location type="stats" path="/none"/>
>        <location type="meta" path="/none"/>
></locations>
>
>When the retention job is kicked off, it finds all the files that needs to
>be evicted based on retention policy. For the feed example mentioned above
>* It gets the location from the feed which is
>"/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}²
>* Then it uses pattern matching to find the file pattern to get the list
>of files for the feed: "/falcon/demo/primary/clicks/*-*-*-*²
>* Calls FileSystem.globStatus with the file pattern
>"/falcon/demo/primary/clicks/*-*-*-*² to get list of files
>* Gets the date from the file path. For e.g. If the file path is
>/falcon/demo/primary/clicks/2016-01-11-02 mapped date is
>2016-01-11-02T00:00Z
>* If the file path date is beyond the retention limit it's deleted
>
>As this uses pattern matching it is not time consuming.
>You can set retention policies on a per-cluster basis and not per field
>basis.
>
>Hope this helps. Let us know if you have any further queries.
>
>Thanks!
>
>On 1/22/16, 9:55 AM, "John Smith" <le...@gmail.com> wrote:
>
>>Hello,
>>
>>I found that Falcon supports retention policy as part of the Lifecycle. I
>>am wondering how is it working, because its not clear to me by reading
>>the
>>documentation.
>>
>>Assume I store one file  (with thousands/million of records) into HDFS
>>and
>>I set retention period for 1 year.
>>
>>How is that retention period enforced on the records inside the file?
>>Does
>>it mean that scheduler executes some "flow" that reads record by record
>>of
>>the stored file every day and check the current date agains retention
>>date?
>>In case the current date >= retention date the record is removed. Is it
>>cpu/time consuming? Each check requires the full file scan?
>>
>>What will happen in scenario when I define different retention dates per
>>field?
>>
>>
>>
>>Thank you!
>>
>>Best,
>>John
>
>


Re: lifecycle - retention

Posted by Praveen Adlakha <pr...@inmobi.com>.
Hi John,

Just to add to Ajay's point of splitting the data.It can be easily done in
Pig Using Multistorage for more details please refer :

https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/MultiStorage.html

Thanks
Praveen

On Mon, Jan 25, 2016 at 10:33 AM, Ajay Yadav <aj...@gmail.com> wrote:

> Hi John,
>
> To avoid reading each line/record of the input we usually divide the data
> by date, e.g. all data for a day in one file. This way you can avoid
> scanning data for all dates during retention. Usually this sort of
> modelling is a good idea for general processing of data also as consumers
> typically consume data for a time range. Sometimes it is not possible to
> *produce* data in such fashion and we have to write aggregator processes to
> batch data. If this is not possible to divide data by date for your use
> case then there is no way to delete data for a particular date without
> reading each line/record of the input file, with or without falcon.
>
>
>
> On Mon, Jan 25, 2016 at 5:03 AM, John Smith <le...@gmail.com> wrote:
>
> > Ok,
> > but in general to execute/or process that kind of requirement there is
> > no other way as to read each line/record of the input file.
> >
> >
> >
> >
> > On Mon, Jan 25, 2016 at 12:23 AM, Venkat Ramachandran
> > <me...@gmail.com> wrote:
> > > It's a good idea to open a JIRA with your requirements.
> > > You can either implement a custom pig job that reads and removes the
> > > expired rows or you can leverage the new Lifecycle feature introduced
> in
> > > Falcon 0.8 that allows you to provide your own plugin for retention
> > > implementation.
> >
>

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.

Re: lifecycle - retention

Posted by Ajay Yadav <aj...@gmail.com>.
Hi John,

To avoid reading each line/record of the input we usually divide the data
by date, e.g. all data for a day in one file. This way you can avoid
scanning data for all dates during retention. Usually this sort of
modelling is a good idea for general processing of data also as consumers
typically consume data for a time range. Sometimes it is not possible to
*produce* data in such fashion and we have to write aggregator processes to
batch data. If this is not possible to divide data by date for your use
case then there is no way to delete data for a particular date without
reading each line/record of the input file, with or without falcon.



On Mon, Jan 25, 2016 at 5:03 AM, John Smith <le...@gmail.com> wrote:

> Ok,
> but in general to execute/or process that kind of requirement there is
> no other way as to read each line/record of the input file.
>
>
>
>
> On Mon, Jan 25, 2016 at 12:23 AM, Venkat Ramachandran
> <me...@gmail.com> wrote:
> > It's a good idea to open a JIRA with your requirements.
> > You can either implement a custom pig job that reads and removes the
> > expired rows or you can leverage the new Lifecycle feature introduced in
> > Falcon 0.8 that allows you to provide your own plugin for retention
> > implementation.
>

Re: lifecycle - retention

Posted by John Smith <le...@gmail.com>.
Ok,
but in general to execute/or process that kind of requirement there is
no other way as to read each line/record of the input file.




On Mon, Jan 25, 2016 at 12:23 AM, Venkat Ramachandran
<me...@gmail.com> wrote:
> It's a good idea to open a JIRA with your requirements.
> You can either implement a custom pig job that reads and removes the
> expired rows or you can leverage the new Lifecycle feature introduced in
> Falcon 0.8 that allows you to provide your own plugin for retention
> implementation.

Re: lifecycle - retention

Posted by Venkat Ramachandran <me...@gmail.com>.
It's a good idea to open a JIRA with your requirements.
You can either implement a custom pig job that reads and removes the
expired rows or you can leverage the new Lifecycle feature introduced in
Falcon 0.8 that allows you to provide your own plugin for retention
implementation.

Re: lifecycle - retention

Posted by John Smith <le...@gmail.com>.
Hello,

do you plan to add that support on the field level?

I can write this process using the pig for example, right? But each
retention check will require full input file check, record by record,
i can see anything more sophisticated how to design/solve it.

Thank you

On Sun, Jan 24, 2016 at 5:14 PM, Venkat Ramachandran
<me...@gmail.com> wrote:
> Falcon data management is agnostic to the data and schema. Retiring
> specific range of rows inside a file is not supported. However, you can
> write a custom job that can read the data, remove those older records and
> write it out - this process can be managed by Falcon. Yes, this will be a
> resource intensive Hadoop job.

Re: lifecycle - retention

Posted by Venkat Ramachandran <me...@gmail.com>.
Falcon data management is agnostic to the data and schema. Retiring
specific range of rows inside a file is not supported. However, you can
write a custom job that can read the data, remove those older records and
write it out - this process can be managed by Falcon. Yes, this will be a
resource intensive Hadoop job.

Re: lifecycle - retention

Posted by John Smith <le...@gmail.com>.
Hello there,


thank you for the reply. Is there any mechanism to keep/enforce retention
policy on the field level per record? By field level I mean that you setup
for each field/column different business date and therefore each record
inside the file has to be processed. To executed this type of scenario it
will be very resource extensive task, thats why im looking for some hint/or
tool if there is any.


Thank you

Re: lifecycle - retention

Posted by Venkat Ramachandran <me...@gmail.com>.
Addicting to Sowmya reply:

Falcon does not look into the data records inside those files to enforce
retention. Basically, it works at a file level taking into account a
name scheme followed in the hdfs file paths.

On Friday, January 22, 2016, Sowmya Ramesh <sr...@hortonworks.com> wrote:

> Hi John,
>
> Retention policy determines how long the data will remain on the cluster.
>
> Falcon kicks off the retention policy on the basis of the time value you
> specify in the retention limit:
>
> * Less than 24 hours: Falcon kicks off the retention policy job every 6
> hours
> * More than 24 hours: Falcon kicks off the retention policy job every 24
> hours
>
> When a feed is scheduled Falcon kicks off the retention policy
> immediately. When job runs, it deletes everything thats eligible for
> eviction - eligibility criteria is the date pattern on the partition and
> NOT creation date. For e.g. if the retention limit is 90 days then
> retention job consistently deletes files older than 90 days.
>
> I don¹t understand what do you mean by records inside the file. I am
> assuming you mean files within a directory.
>
> For retention, Falcon expects data to be in dated partitions. I will try
> to explain the retention policy logic with an example.
> Lets say your feed location is defined as below:
>
> <locations>
>         <location type=³data"
> path=³/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
>         <location type="stats" path="/none"/>
>         <location type="meta" path="/none"/>
> </locations>
>
> When the retention job is kicked off, it finds all the files that needs to
> be evicted based on retention policy. For the feed example mentioned above
> * It gets the location from the feed which is
> "/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}²
> * Then it uses pattern matching to find the file pattern to get the list
> of files for the feed: "/falcon/demo/primary/clicks/*-*-*-*²
> * Calls FileSystem.globStatus with the file pattern
> "/falcon/demo/primary/clicks/*-*-*-*² to get list of files
> * Gets the date from the file path. For e.g. If the file path is
> /falcon/demo/primary/clicks/2016-01-11-02 mapped date is
> 2016-01-11-02T00:00Z
> * If the file path date is beyond the retention limit it's deleted
>
> As this uses pattern matching it is not time consuming.
> You can set retention policies on a per-cluster basis and not per field
> basis.
>
> Hope this helps. Let us know if you have any further queries.
>
> Thanks!
>
> On 1/22/16, 9:55 AM, "John Smith" <lenovomi@gmail.com <javascript:;>>
> wrote:
>
> >Hello,
> >
> >I found that Falcon supports retention policy as part of the Lifecycle. I
> >am wondering how is it working, because its not clear to me by reading the
> >documentation.
> >
> >Assume I store one file  (with thousands/million of records) into HDFS and
> >I set retention period for 1 year.
> >
> >How is that retention period enforced on the records inside the file? Does
> >it mean that scheduler executes some "flow" that reads record by record of
> >the stored file every day and check the current date agains retention
> >date?
> >In case the current date >= retention date the record is removed. Is it
> >cpu/time consuming? Each check requires the full file scan?
> >
> >What will happen in scenario when I define different retention dates per
> >field?
> >
> >
> >
> >Thank you!
> >
> >Best,
> >John
>
>

Re: lifecycle - retention

Posted by Sowmya Ramesh <sr...@hortonworks.com>.
Hi John,

Retention policy determines how long the data will remain on the cluster.

Falcon kicks off the retention policy on the basis of the time value you
specify in the retention limit:

* Less than 24 hours: Falcon kicks off the retention policy job every 6
hours
* More than 24 hours: Falcon kicks off the retention policy job every 24
hours

When a feed is scheduled Falcon kicks off the retention policy
immediately. When job runs, it deletes everything thats eligible for
eviction - eligibility criteria is the date pattern on the partition and
NOT creation date. For e.g. if the retention limit is 90 days then
retention job consistently deletes files older than 90 days.

I don¹t understand what do you mean by records inside the file. I am
assuming you mean files within a directory.

For retention, Falcon expects data to be in dated partitions. I will try
to explain the retention policy logic with an example.
Lets say your feed location is defined as below:

<locations>
        <location type=³data"
path=³/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
        <location type="stats" path="/none"/>
        <location type="meta" path="/none"/>
</locations>

When the retention job is kicked off, it finds all the files that needs to
be evicted based on retention policy. For the feed example mentioned above
* It gets the location from the feed which is
"/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}²
* Then it uses pattern matching to find the file pattern to get the list
of files for the feed: "/falcon/demo/primary/clicks/*-*-*-*²
* Calls FileSystem.globStatus with the file pattern
"/falcon/demo/primary/clicks/*-*-*-*² to get list of files
* Gets the date from the file path. For e.g. If the file path is
/falcon/demo/primary/clicks/2016-01-11-02 mapped date is
2016-01-11-02T00:00Z
* If the file path date is beyond the retention limit it's deleted

As this uses pattern matching it is not time consuming.
You can set retention policies on a per-cluster basis and not per field
basis.

Hope this helps. Let us know if you have any further queries.

Thanks!

On 1/22/16, 9:55 AM, "John Smith" <le...@gmail.com> wrote:

>Hello,
>
>I found that Falcon supports retention policy as part of the Lifecycle. I
>am wondering how is it working, because its not clear to me by reading the
>documentation.
>
>Assume I store one file  (with thousands/million of records) into HDFS and
>I set retention period for 1 year.
>
>How is that retention period enforced on the records inside the file? Does
>it mean that scheduler executes some "flow" that reads record by record of
>the stored file every day and check the current date agains retention
>date?
>In case the current date >= retention date the record is removed. Is it
>cpu/time consuming? Each check requires the full file scan?
>
>What will happen in scenario when I define different retention dates per
>field?
>
>
>
>Thank you!
>
>Best,
>John