You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Kristian Jones <kr...@googlemail.com> on 2018/04/06 23:36:02 UTC

Dynamically trigger backfill as a result of discovery DAG

Hopefully this makes enough sense to people reading.

We are using Airflow to build up a DWH from many source systems but I have
a couple of edge cases that I’m not quite sure how to proceed with.

Each table from a source system is supplied as a CSV. With each set of
files delivered (daily) we receive a manifest/description file containing a
list of files that have been sent. The number of files delivered could vary
as supplying systems become un-available for reporting periods,
connectivity or for some other outage reason, or when no
“transactions”/events occurred. Due to an outage style event, A CSV can
contain data that covers “transactions”/events for a date range not just a
single day (not normally, but possible).

Additionally correction extracts could be supplied that traditionally would
result in an update but due to some technology choices, updates become
difficult, so ideally we are looking to follow the same patterns described
here :

https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
<https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a>
Functional Data Engineering — a modern paradigm for batch ...
<https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a>
Batch data processing — historically known as ETL — is extremely
challenging. It’s time-consuming, brittle, and often unrewarding. Not only
that, it’s hard to operate, evolve, and troubleshoot. In this post, we’ll
explore how applying the functional programming paradigm to data ...
medium.com

We have some code to re-partition each of extract by an event/transaction
date before they can be consumed by any follow-on workflows/DAGs.

I think what I’m trying to do is trigger a follow on DAG/subDAG for each
daily partition generated (but until we've inspected the data, the date
range of snapshots/partitions to be generated isn't known), this seems to
be analogous to kicking off a backfill, but I’m not sure how/if this can be
done as part of the actual pipeline. We can put this logic into our PySpark
scripts but it seems we would loose the nice features of the UI describing
each stage of the ETL pipeline and MI regarding each task or to detect
failure etc.

I suppose what I'm asking is whether there is an Airflow idiomatic way of
doing this. We would like to retain the ability to perform a backfill and
know that we have deterministic behaviour given a set of inputs (a manifest
file + set of source files)

Thanks in advance.