You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@falcon.apache.org by Mahak Mukhi <mm...@yahoo-inc.com.INVALID> on 2015/06/25 18:32:10 UTC

Falcon late arrival and cut off

Hi,
I wanted to get a clearer picture on how does falcon handle late arrivals? Does it wait for the specific feed instance for cut off time before failing or would it look for all files in the time interval (current - cut off) to (current).Consider the following 2 scenarios, I'd like to know which one corresponds with falcon:
There's a feed set up for replication with a frequency of 10 minutes and the late arrival cut off time is set to be an hour.
Scenario 1 : A feed instance runs at 17:30 for replication but a file ending in 1730 isn't available yet. So, the instance is rescheduled for a later time and this keeps on happening until the file is found or the late arrival cut off time (an hour in this case) is reached. In latter case, the replication job fails.
Scenario 2: A feed instance runs at 17:30 for replication and finds that a file ending in 1720 is now available which wasn't available when the last replication instance ran(at 17:20). So, now it copies both the files (the one ending in 1730 and the one ending in 1720). 

I'm inclined to believe that scenario 1 corresponds with Falcon, however I want to confirm that I'm not missing anything.In case, it is Scenario 2, how does falcon keep track of what files have been copied?
Your help is much appreciated. Thanks.
 Regards,
Mahak Mukhi

Falcon late arrival and cut off

Posted by Idris Ali <ps...@gmail.com>.
Hi Mahak,

To quickly answer your question.
Scenario 1 : A feed instance runs at 17:30 for replication but a file
ending in 1730 isn't available yet. So, the instance is rescheduled for a
later time and this keeps on happening until the file is found or the late
arrival cut off time (an hour in this case) is reached.
- Assuming its a feed with f*requency minutes(10),* this scenario has
nothing to do with late-data, when the availability flag is ready, the
replication kicks off, otherwise the 17:30 replication instance will be in
"Waiting" state. Once the availability flag is found the instance goes to
"Running" state and replicates the data to target cluster and this instance
17:30 is considered as "Success".

Scenario 2: A feed instance runs at 17:30 for replication and finds that a
file ending in 1720 is now available which wasn't available when the last
replication instance ran(at 17:20). So, now it copies both the files (the
one ending in 1730 and the one ending in 1720).
- No it wont copy data from both the instances, since 17:20 is available
for the first time, it simply copies 17:20's data alone. And feed instance
for 17:30 will check for data under 17:30 directory alone. Both are
independent instances.


Late arrival works for both Feed and Process and the details on the
functionality is available in Falcon documentation.
Please check
http://falcon.apache.org/0.6-incubating/EntitySpecification.html#Feed_Specification
"Late Data" section.


Since your question is related to Feed replication (late-data) I will try
to answer here:
1. From Feed definition, lets say we have
 <frequency>hours(1)</frequency>

<late-arrival cut-off="hours(6)"/>

2. From falcon runtime.properties
A feed cut-off policy is required for late-data handling for Feeds.
allowed policies: periodic, exp-backoff(exponential backoff) and final
Ex: periodic with delay=hours(2),

Here, falcon would replicate the feed once every hour 17:00, 18:00 and so
on.
late-arrival specifies, since how *long this feed should be checked for
late data changes in the Source cluster*. In this case 6 hours.
So, for the instance 17:00, it is honoured till(17+6) 23:00 hour and for
instance 18:00, 00:00 (next day) and so on.

*When to check?* is specified by the cut-off policy, here it says periodic,
hours(2), so falcon checks for changes every 2 hours in source cluster
input.
So, falcon would check the instance 17:00 at time 19:00 for the data in
source cluster, followed by 21:00 and finally at 23:00.

*How changes are detected?* Falcon maintains the data size for every
instance run, so it records the size of data at first run (17:00)
if it detects a different size in source input in next period check 19:00,
it simply reruns the entire replication by *overriding* the previous
replicated data.



Hope it answers your question.

Thanks,
-Idris









On Thu, Jun 25, 2015 at 10:02 PM, Mahak Mukhi <mmukhi@yahoo-inc.com.invalid
<javascript:_e(%7B%7D,'cvml','mmukhi@yahoo-inc.com.invalid');>> wrote:

> Hi,
> I wanted to get a clearer picture on how does falcon handle late arrivals?
> Does it wait for the specific feed instance for cut off time before failing
> or would it look for all files in the time interval (current - cut off) to
> (current).Consider the following 2 scenarios, I'd like to know which one
> corresponds with falcon:
> There's a feed set up for replication with a frequency of 10 minutes and
> the late arrival cut off time is set to be an hour.
> Scenario 1 : A feed instance runs at 17:30 for replication but a file
> ending in 1730 isn't available yet. So, the instance is rescheduled for a
> later time and this keeps on happening until the file is found or the late
> arrival cut off time (an hour in this case) is reached. In latter case, the
> replication job fails.
> Scenario 2: A feed instance runs at 17:30 for replication and finds that a
> file ending in 1720 is now available which wasn't available when the last
> replication instance ran(at 17:20). So, now it copies both the files (the
> one ending in 1730 and the one ending in 1720).
>
> I'm inclined to believe that scenario 1 corresponds with Falcon, however I
> want to confirm that I'm not missing anything.In case, it is Scenario 2,
> how does falcon keep track of what files have been copied?
> Your help is much appreciated. Thanks.
>  Regards,
> Mahak Mukhi
>