You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Aitor Arjona (Jira)" <ji...@apache.org> on 2019/11/27 20:59:00 UTC

[jira] [Created] (AIRFLOW-6093) Incorporate IBM-PyWren Plugin to contrib operators

Aitor Arjona created AIRFLOW-6093:
-------------------------------------

             Summary: Incorporate IBM-PyWren Plugin to contrib operators
                 Key: AIRFLOW-6093
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6093
             Project: Apache Airflow
          Issue Type: New Feature
          Components: contrib, hooks, operators
    Affects Versions: 1.10.6
            Reporter: Aitor Arjona
            Assignee: Aitor Arjona


The aim of this issue is to add IBM-PyWren plugin operators to the default Airflow contrib/operators. This plugin can be found [here|https://github.com/cloudbutton/ibm-pywren_airflow-plugin].
 * *What is IBM-PyWren?*

[PyWren|https://github.com/pywren/pywren] is an open source project whose goals are massively scaling the execution of Python code and its dependencies on serverless computing platforms and monitoring the results. PyWren delivers the user’s code into the serverless platform in a transparent way, without requiring knowledge of how functions are deployed, invoked and run.
[IBM-PyWren|https://github.com/pywren/pywren-ibm-cloud] is based on PyWren's main branch and adapted for IBM Cloud Functions and IBM Cloud Object Storage. IBM-PyWren is not, however, just a mere reimplementation of PyWren’s API atop IBM Cloud Functions. Rather, it is must be viewed as an advanced extension of PyWren to run broader Map-Reduce jobs, based on Docker images. In extending PyWren to work with IBM Cloud Object Storage, it was also added a partition discovery component was that allows PyWren to process large amounts of data stored in the IBM Cloud Object Storage.
 * *What IBM-PyWren does and what has the Airflow plugin to offer?*

IBM-PyWren provides a clean and easy to use interface that implements the typical Map and Map-Reduce programming models, but executes every map function as a serverless function running in a serverless platform. This allows us to run embarrassingly parallel workloads, like some big data applications, that have already been proved to be very efficient and useful.

The following snippet portrays the simplicity of using IBM-PyWren:

 
{code:java}
import pywren_ibm_cloud as pywren

def my_map_function(id, x):
      return x + 7
      
      
iterdata = [1, 2, 3, 4]
pw = pywren.ibm_cf_executor()
pw.map(my_map_function, iterdata)
result = pw.get_result()   # result = [8, 9, 10, 11] 
{code}
IBM-PyWren map and map-reduce jobs are asyncronous calls, that is, user code can continue execution until `{{get_result()}}` is called, where execution is blocked until the map results are ready.

There has been already implemented an Airflow plugin that adds the following operators and hooks:

-{{ IbmPyWrenCallAsyncOperator(callable, data)}}: Invokes a single function
-{{ IbmPyWrenMapOperator(map_function, map_iterdata)}}: Invokes invokes multiple parallel tasks, as many as how much data is in parameter {{map_iterdata}}. It applies the function {{map_function}} to every element in {{map_iterdata.}}

{{- IbmPyWrenMapReduceOperator(map_function, map_iterdata, reduce_function)}}: It invokes multiple parallel tasks, as many as how much data is in parameter {{map_iterdata}}. It applies the function {{map_function}} to every element in {{map_terdata}}. Finally, in invokes a {{reduce_function}} that gathers all the map results.

- IbmPyWrenHook(): Provides a PyWren executor ready to use.{{}}

 
 * *But Airflow already has operators to work with serverless platforms*

However, using IBM-PyWren as a client and runtime to work with serverless functions has some advantages that other clients don't have:

- IBM-PyWren is specifically designed and optimized to run Map and Map-Reduce jobs, oriented towards data analytics uses.

- IBM-PyWren is designed to automatically pickle already written user sequential code with its dependencies and massively parallelize its execution by running it in thousands of cores using a variety of serverless platforms even on different regions, including KNative, IBM Cloud Functions, AWS Lambda, GCP Functions and Azure Functions.

- IBM-PyWren executor efficiently manages the invocation of +1000 functions and monitors its execution by wrapping the user code and handling exceptions when the function raises one, runs out of memory or times out. These errors wouldn't be handled in regular serverless function execution runtimes.

- IBM-PyWren implements a data partitioner and discovery functionality that automatically parses an object from COS into smaller fixed-size chunks so that it is easier to iterate over the data.

- IBM-PyWren eases the way user code communicates with object storage to get and put data or pass data between functions, or to send and receive events through rabbitmq queues, by providing ready to use clients that are already authenticated.

- IBM-PyWren implements its calls following the 'future' or 'promise' asynchronous programming model. This allows us to execute the map and map-reduce jobs in an asynchronous way and consulting the result further into the code.
 * *Why IBM-PyWren + Airflow?*

Currently, Airflow doesn't fully take advantage of serverless cloud computing execution model. In order to scale out Airflow, we need to use kubernetes executor or celery executor, either way, we have to provision the infrastructure, which probably will be idle most of the time. Even then, if we want to process up to 1000 parallel tasks or more, we would need lots of horse power on our cluster, skyrocketing the maintenance price, or else be content with a slow data processing pipeline. IBM-PyWren brings massive scaling and parallelization of tasks to even the most humble Airflow cluster, in addition to only having to pay for the resources we actually use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)