You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Aizhamal Nurmamat kyzy (JIRA)" <ji...@apache.org> on 2019/05/17 20:22:00 UTC
[jira] [Updated] (AIRFLOW-825) Add Dataflow semantics
[ https://issues.apache.org/jira/browse/AIRFLOW-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aizhamal Nurmamat kyzy updated AIRFLOW-825:
-------------------------------------------
Component/s: (was: Dataflow)
scheduler
> Add Dataflow semantics
> ----------------------
>
> Key: AIRFLOW-825
> URL: https://issues.apache.org/jira/browse/AIRFLOW-825
> Project: Apache Airflow
> Issue Type: New Feature
> Components: scheduler
> Reporter: Jeremiah Lowin
> Assignee: Jeremiah Lowin
> Priority: Major
> Fix For: 1.10.0
>
>
> Following discussion on the dev list, this adds first-class Dataflow semantics to Airflow.
> Please see my PR for examples and unit tests. From the documentation:
> A Dataflow object represents the result of an upstream task. If the upstream
> task has multiple outputs contained in a tuple, dict, or other indexable form,
> an index may be provided so the Dataflow only uses the appropriate output.
> Dataflows are passed to downstream tasks with a key. This has two effects:
> 1. It sets up a dependency between the upstream and downstream tasks to
> ensure that the downstream task does not run before the upstream result
> is available.
> 2. It ensures that the [indexed] upstream result is available in the
> downstream task's context as ``context['dataflows'][key]``. In addition,
> the result will be passed directly to PythonOperators as a keyword
> argument.
> Dataflows use the XCom mechanism to exchange data. Data is passed through the
> following series of steps:
> 1. After the upstream task runs, data is passed to the Dataflow object's
> _set_data() method.
> 2. The Dataflow's serialize() method is called on the data. This method
> takes the data object and returns a representation that can be used to
> reconstruct it later.
> 3. _set_data() stores the serialized result as an XCom.
> 4. Before the downstream task runs, it calls the Dataflow _get_data()
> method.
> 5. _get_data() retrieves the upstream XCom.
> 6. The Dataflow's deserialize() method is called. This method takes the
> serialiezd representation and returns the data object.
> 7. The data object is passed to the downstream task.
> The basic Dataflow object has identity serialize and deserialize methods,
> meaning data is stored directly in the Airflow database. Therefore, for
> performance and practical reasons, basic Dataflows should not be used with
> large or complex results.
> Dataflows can easily be extended to use remote storage. In this case, the
> serialize method should write the data in to storage and return a URI, which
> will be stored as an XCom. The URI will be passed to deserialize() so that
> the data can be downloaded and reconstructed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)