You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@drill.apache.org by dz...@apache.org on 2021/08/22 17:58:50 UTC

[drill] 01/02: Drill provider for Airflow blog post.

This is an automated email from the ASF dual-hosted git repository.

dzamo pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/drill.git

commit aa99123c5690cfacb74925df740b02f5c3b6350b
Author: James Turton <ja...@somecomputer.xyz>
AuthorDate: Thu Aug 5 16:01:44 2021 +0200

    Drill provider for Airflow blog post.
---
 .../install/047-installing-drill-on-the-cluster.md |  2 +-
 ...leased.md => 2018-03-18-drill-1.13-released.md} |  0
 .../en/2021-08-05-drill-provider-for-airflow.md    | 28 ++++++++++++++++++++++
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/_docs/en/install/047-installing-drill-on-the-cluster.md b/_docs/en/install/047-installing-drill-on-the-cluster.md
index de359b0..2761af9 100644
--- a/_docs/en/install/047-installing-drill-on-the-cluster.md
+++ b/_docs/en/install/047-installing-drill-on-the-cluster.md
@@ -16,7 +16,7 @@ You install Drill on nodes in the cluster, configure a cluster ID, and add Zooke
 
 ### (Optional) Create the site directory
 
-The site directory contains your site-specific files for Drill.  Putting these in a separate directory to the Drill installation means that upgrading Drill will not clobber your configuration and custom code.  It is possible to skip this step, meaning that your configuration and custom code will live in the `$DRILL_HOME/conf` and `$DRILL_HOME/jars/3rdparty` subdirectories respectively.
+The site directory contains your site-specific files for Drill.  Putting these in a separate directory to the Drill installation means that upgrading Drill will not overwrite your configuration and custom code.  It is possible to skip this step, meaning that your configuration and custom code will live in the `$DRILL_HOME/conf` and `$DRILL_HOME/jars/3rdparty` subdirectories respectively.
 
 Create the site directory in a suitable location, e.g.
 
diff --git a/blog/_posts/en/2018-3-18-drill-1.13-released.md b/blog/_posts/en/2018-03-18-drill-1.13-released.md
similarity index 100%
rename from blog/_posts/en/2018-3-18-drill-1.13-released.md
rename to blog/_posts/en/2018-03-18-drill-1.13-released.md
diff --git a/blog/_posts/en/2021-08-05-drill-provider-for-airflow.md b/blog/_posts/en/2021-08-05-drill-provider-for-airflow.md
new file mode 100644
index 0000000..b643924
--- /dev/null
+++ b/blog/_posts/en/2021-08-05-drill-provider-for-airflow.md
@@ -0,0 +1,28 @@
+---
+layout: post
+title: "Drill provider for Airflow"
+code: drill-provider-for-airflow
+excerpt: In its provider package release this month, the Apache Airflow project added a provider for interacting with Apache Drill.  This allows data engineers and data scientists to incorporate Drill queries in their Airflow DAGs, enabling the automation of big data and data science workflows.
+
+authors: ["jturton"]
+---
+
+You're building a new report, visualisation or ML model.  Most of the data involved comes from sources well known to you but a new source has become available, allowing your team to measure and model new variables.  Eager to get to a prototype and an early sense of what the new analytics look like, you head straight for the first order of business and start to construct a first version of the dataset upon which your final output will be based.
+
+The data sources you need to combine are immediately accessible but heteregenous: transactional data in PostgreSQL must be combined with data from another team that uses Splunk, lookup data maintained by operational team in an Excel spreadsheet, thousands of XML exports received from a partner and some Parquet files already in your big data environment just for good measure.
+
+Using Drill iteratively you query and join in each data source one at a time, applying grouping, filtering and other intensive transformations as you go, finally producing a dataset with the fields and grain you need.  You store it by adding CREATE TABLE AS in front of your final SELECT then write a few counting and summing queries against the original data sources and your transformed dataset to check that your code produces the expected outputs.
+
+Apart from possibly configuring some new storage plugins in the Drill web UI, you have so far not left DBeaver (or your editor of choice).  The onerous data exploration and plumbing parts of your project have flashed by in a blaze of SQL, and you move your dataset into the next tool for visualisation or modelling.  The results are good and you know that your users will immediately ask for the outputs to incorporate new data on a regular schedule.
+
+While Drill can assemble your dataset on the fly, as it did while you prototyped,  doing that for the full set takes over 20 minutes, places more load than you'd like in office hours on to your data sources and limits you to the history that the sources keep, in some cases only a few weeks.
+
+It's time for ETL, you concede.  In the past that meant you had to choose between keeping your working Drill SQL and scheduling it using 70s Unix tools like Cron and Bash, or recreating your Drill SQL in other tools and languages, perhaps Apache Beam or PySpark, and requiring multiple tools if you don't have one that is as omnivorous as Drill.  But this time it's different...
+
+[Apache Airflow](https://airflow.apache.org) is a workflow engine built in the Python programming ecosystem that has grown into a leading choice for orchestrating big data pipelines, amongst its other applications.  Perhaps the first point to understand about Airflow in the context of ETL is that it is designed only for workflow _control_, and not for data flow.  This makes it different from some of the ETL tools you might have encountered like Microsoft's SSIS or Pentaho's PDI which han [...]
+
+In contrast Airflow is, unless you're doing it wrong, used only to instruct other software like Spark, Beam, PostgreSQL, Bash, Celery, Scikit-learn scripts, Slack, (... the list  of connectors is long and varied) to kick off actions at scheduled times.  While Airflow does load its schedules from the crontab format, a comparison to cron stops there.  Airflow can resolve and execute complex job DAGs with options for clustering, parallelism, retries, backfilling and performance monitoring.
+
+The exciting news for Drill users is that [a new provider package adding support for Drill](https://pypi.org/project/apache-airflow-providers-apache-drill/) was added to Airflow this month.  This provider is based on the [sqlalchemy-drill package](https://pypi.org/project/sqlalchemy-drill/) which provides Drill connectivity for Python programs.  This means that you can add tasks which execute queries on Drill to your Airflow DAGs without any hacky intermediate shell scripts, or build new [...]
+
+In the coming days a basic tutorial for using Drill with Airflow will be added to this site, and this sentence replaced with a link.