You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/01/11 12:03:34 UTC

[GitHub] [airflow] TobKed commented on a change in pull request #13461: Add How To Guide for Dataflow

TobKed commented on a change in pull request #13461:
URL: https://github.com/apache/airflow/pull/13461#discussion_r554999776



##########
File path: docs/apache-airflow-providers-google/operators/cloud/dataflow.rst
##########
@@ -0,0 +1,274 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+Google Cloud Dataflow Operators
+===============================
+
+`Dataflow <https://cloud.google.com/dataflow/>`__ is a managed service for
+executing a wide variety of data processing patterns. These pipelines are created
+using the Apache Beam programming model which allows for both batch and streaming.
+
+.. contents::
+  :depth: 1
+  :local:
+
+Prerequisite Tasks
+^^^^^^^^^^^^^^^^^^
+
+.. include::/operators/_partials/prerequisite_tasks.rst
+
+Ways to run a data pipeline
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are multiple options to execute a Dataflow pipeline on Airflow. If looking to execute the pipeline
+code from a source file (Java or Python) it would be best to use the language specific create operators.
+If a process exists to stage the pipeline code in an abstracted manner - a Templated job would be best as
+it allows development of the application without minimal intrusion to the DAG containing operators for it.
+It is also possible to run jobs defined in SQL language.
+
+Starting a new job
+^^^^^^^^^^^^^^^^^^
+
+To create a new pipeline using the source file (JAR in Java or Python file) use
+the create job operators. The source file can be located on GCS or on the local filesystem.
+:class:`~airflow.providers.google.cloud.operators.dataflow.DataflowCreateJavaJobOperator`
+or
+:class:`~airflow.providers.google.cloud.operators.dataflow.DataflowCreatePythonJobOperator`
+
+Please see the notes below on Java and Python specific SDKs as they each have their own set
+of execution options when running pipelines.
+
+Language specific pipelines
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Based on which language (SDK) is used for the Dataflow operators, there are specific options to be wary of.
+
+.. _howto/operator:DataflowCreateJavaJobOperator:
+
+Java SDK pipelines
+""""""""""""""""""
+
+The ``jar`` argument must be specified for
+:class:`~airflow.providers.google.cloud.operators.dataflow.DataflowCreateJavaJobOperator`
+as it contains the pipeline to be executed on Dataflow. The JAR can be available on GCS that Airflow
+has the ability to download or available on the local filesystem (provide the absolute path to it).
+
+Here is an example of creating and running a pipeline in Java with jar stored on GCS:
+
+.. exampleinclude:: /../../airflow/providers/google/cloud/example_dags/example_dataflow.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_start_java_job_jar_on_gcs]
+    :end-before: [END howto_operator_start_java_job_jar_on_gcs]
+
+
+Here is an example of creating and running a pipeline in Java with jar stored on GCS:
+
+.. exampleinclude:: /../../airflow/providers/google/cloud/example_dags/example_dataflow.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_start_java_job_local_jar]
+    :end-before: [END howto_operator_start_java_job_local_jar]
+
+.. _howto/operator:DataflowCreatePythonJobOperator:
+
+Python SDK pipelines
+""""""""""""""""""""
+
+The ``py_file`` argument must be specified for
+:class:`~airflow.providers.google.cloud.operators.dataflow.DataflowCreatePythonJobOperator`
+as it contains the pipeline to be executed on Dataflow. The Python file can be available on GCS that Airflow
+has the ability to download or available on the local filesystem (provide the absolute path to it).
+
+The ``py_interpreter`` argument specifies the Python version to be used when executing the pipeline, the default
+is ``python3`. If your Airflow instance is running on Python 2 - specify ``python2`` and ensure your ``py_file`` is
+in Python 2. For best results, use Python 3.
+
+If ``py_requirements`` argument is specified a temporary Python virtual environment with specified requirements will be created
+and within it pipeline will run.
+
+The ``py_system_site_packages`` argument specifies whether or not all the Python packages from your Airflow instance,
+will be accessible within virtual environment (if ``py_requirements`` argument is specified),
+recommend avoiding unless the Dataflow job requires it.
+
+.. exampleinclude:: /../../airflow/providers/google/cloud/example_dags/example_dataflow.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_start_python_job]
+    :end-before: [END howto_operator_start_python_job]
+
+
+Execution options for pipelines
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Dataflow has multiple options of executing pipelines. It can be done in the following modes:
+batch asynchronously (fire and forget), batch blocking (wait until completion), or streaming (run indefinitely).

Review comment:
       It is based on the Dataflow documentation:
   
   https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#configuring-pipelineoptions-for-execution-on-the-cloud-dataflow-service




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org