You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/12/02 16:02:11 UTC

[GitHub] [airflow] TobKed commented on a change in pull request #8809: [AIRFLOW-6294] Create guide for Dataflow operators

TobKed commented on a change in pull request #8809:
URL: https://github.com/apache/airflow/pull/8809#discussion_r534284435



##########
File path: docs/howto/operator/gcp/dataflow.rst
##########
@@ -0,0 +1,180 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+Google Cloud Dataflow Operators
+===============================
+
+`Dataflow <https://cloud.google.com/dataflow/>`__ is a managed service for
+executing a wide variety of data processing patterns. These pipelines are created
+using the Apache Beam programming model which allows for both batch and streaming.
+
+.. contents::
+  :depth: 1
+  :local:
+
+Prerequisite Tasks
+^^^^^^^^^^^^^^^^^^
+
+.. include:: _partials/prerequisite_tasks.rst
+
+Ways to run a data pipeline
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are multiple options to execute a Dataflow pipeline on Airflow. If looking to execute the pipeline
+code from a source file (Java or Python) it would be best to use the language specific create operators.
+If a process exists to stage the pipeline code in an abstracted manner - a Templated job would be best as
+it allows development of the application without minimal intrusion to the DAG containing operators for it.
+
+.. _howto/operator:DataflowCreateJavaJobOperator:
+.. _howto/operator:DataflowCreatePythonJobOperator:
+
+Starting a new job
+""""""""""""""""""
+
+To create a new pipeline using the source file (JAR in Java or Python file) use
+the create job operators. The source file can be located on GCS or on the local filesystem.
+:class:`~airflow.providers.google.cloud.operators.dataflow.DataflowCreateJavaJobOperator`
+or
+:class:`~airflow.providers.google.cloud.operators.dataflow.DataflowCreatePythonJobOperator`
+
+Please see the notes below on Java and Python specific SDKs as they each have their own set
+of execution options when running pipelines.
+
+Here is an example of creating and running a pipeline in Java:
+
+.. exampleinclude:: ../../../../airflow/providers/google/cloud/example_dags/example_dataflow.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_start_java_job]
+    :end-before: [END howto_operator_start_java_job]
+
+.. _howto/operator:DataflowTemplatedJobStartOperator:
+
+Templated jobs
+""""""""""""""
+
+Templates give the ability to stage a pipeline on Cloud Storage and run it from there. This
+provides flexibility in the development workflow as it separates the development of a pipeline
+from the staging and execution steps. To start a templated job use the
+:class:`~airflow.providers.google.cloud.operators.dataflow.DataflowTemplatedJobStartOperator`
+
+.. exampleinclude:: ../../../../airflow/providers/google/cloud/example_dags/example_dataflow.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_start_template_job]
+    :end-before: [END howto_operator_start_template_job]
+
+See the `list of Google-provided templates that can be used with this operator
+<https://cloud.google.com/dataflow/docs/guides/templates/provided-templates>`_.
+
+Execution options for pipelines
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Dataflow has multiple options of executing pipelines. It can be done in the following modes:
+asynchronously (fire and forget), blocking (wait until completion), or streaming (run indefinitely).
+In Airflow it is best to use asynchronous pipelines as blocking ones tax the Airflow resources by listening
+to the job until it completes.
+
+Asynchronous execution
+""""""""""""""""""""""
+
+Dataflow jobs are by default asynchronous - however this is dependent on the application code (contained in the JAR
+or Python file) and how it is written. In order for the Dataflow job to execute asynchronously, ensure the
+pipeline objects are not being waited upon (not calling ``waitUntilFinish`` or ``wait_until_finish`` on the
+``PipelineResult`` in your application code).
+
+This is the recommended way to execute your pipelines when using Airflow.
+
+Use the Dataflow monitoring or command-line interface to view the details of your pipeline's results after Airflow
+runs the operator.
+
+Blocking execution

Review comment:
       Asynchronous execution was added in https://github.com/apache/airflow/pull/11726




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org