You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@falcon.apache.org by Mikhail Ilin <Mi...@nortal.com> on 2016/02/11 13:07:59 UTC

Question about handling of late data

Hello!


I have a shell script that performs certain actions on files in HDFS. Script is run through Oozie workflow which I want to schedule in Falcon.

Files, as usual, are located in partitions (/some_root_dir/2016/02/03 etc).

Every day new directory appears and new data arrives.


The problem is, sometimes data may be late for a few days and I want Falcon to recognize that and, upon late arrival, run Oozie/Shell action on that data as well - not only on today's portion.

But that part is insufficiently documented at the moment:


https://falcon.apache.org/FalconDocumentation.html#Handling_late_input_data

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_data_governance/content/ch_falcon_late_data_handling.html


I don't understand, what should be in the late workflow?

How and at which moment does Falcon decide on which directories to run that late workflow?

How are the dates (locations) of those directories passed to the late workflow??


Best regards,

Mike

Re: Question about handling of late data

Posted by Balu Vellanki Bala <bv...@hortonworks.com>.
Hi Mike,

Suppose your feed describes late arrival policy as
    <late-arrival cut-off=³days(3)"/>

Suppose you specify input as

<inputs>
    <input end="now(0,0)" start="now(0,-1)" feed="raaw-logs16"
name="inputData"/>
</inputs>


Now, you can have late process as

<late-process policy=³periodic" delay=³hours(6)">
        <late-input input="inputData"
workflow-path="hdfs://inputData/late/workflow" />
</late-process>


To answer your questions,
‹ Make the shell script part of shell action in oozie workflow.
‹ If you want to handle late-data in a manner different from regular data,
you should have a two different workflows. Else, you can have the same
workflow you use to handle regular data.
‹ In above example, Falcon looks periodically (Policy) every six hours
(delay) for the late data to arrive until 3 days (late-arrival cut-off
defined in feed).

The date-patterns used for input data will not change between on-time data
and late-arrival data.


Thank you
Balu Vellanki 

On 2/11/16, 4:07 AM, "Mikhail Ilin" <Mi...@nortal.com> wrote:

>Hello!
>
>
>I have a shell script that performs certain actions on files in HDFS.
>Script is run through Oozie workflow which I want to schedule in Falcon.
>
>Files, as usual, are located in partitions (/some_root_dir/2016/02/03
>etc).
>
>Every day new directory appears and new data arrives.
>
>
>The problem is, sometimes data may be late for a few days and I want
>Falcon to recognize that and, upon late arrival, run Oozie/Shell action
>on that data as well - not only on today's portion.
>
>But that part is insufficiently documented at the moment:
>
>
>https://falcon.apache.org/FalconDocumentation.html#Handling_late_input_dat
>a
>
>https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_data_governanc
>e/content/ch_falcon_late_data_handling.html
>
>
>I don't understand, what should be in the late workflow?
>
>How and at which moment does Falcon decide on which directories to run
>that late workflow?
>
>How are the dates (locations) of those directories passed to the late
>workflow??
>
>
>Best regards,
>
>Mike