You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Rafael Barbosa <rr...@gmail.com> on 2018/07/24 10:08:26 UTC

Initial load best practices

I am creating a number of small ETL tasks to pull data into a Redshift
instance. For every task, I want a one-time bootstrapping task to create
the target table (and maybe even the schema) and do the initial load.

Preferably, I would like the DAG to perform this initial load, instead
doing it manually. This would be particularly useful when testing the DAG
in different environments (e.g., local vs production).

I see two ways of doing this:

1. Have a bootstrap operator in the DAG that `create if not exists` the
table, and rely on the DAG start date to backfill the data in the initial
load. E.g. `bootstrap` -> `fetch` -> `load`.
2. Implement an operator that does the initial load if the table does not
exist, or a incremental load if it does. The start date for the initial
upload is a parameter for the operator. Similar to what is described in
this post
<https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-ii-47c4e7cbda71#a102>
in
"parameterize workflow".

How do you typically handle this situation? What are the best practices for
an initial load?

Best,
Rafael