You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Ufuk Celebi (JIRA)" <ji...@apache.org> on 2014/07/10 11:58:04 UTC

[jira] [Resolved] (FLINK-528) First release for python-language-binding

     [ https://issues.apache.org/jira/browse/FLINK-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ufuk Celebi resolved FLINK-528.
-------------------------------

    Resolution: Invalid

> First release for python-language-binding
> -----------------------------------------
>
>                 Key: FLINK-528
>                 URL: https://issues.apache.org/jira/browse/FLINK-528
>             Project: Flink
>          Issue Type: Bug
>          Components: Python API
>            Reporter: GitHub Import
>              Labels: github-import
>             Fix For: pre-apache
>
>
> Hi, 
> since the python-language-binding is moving forward towards a first release I wanted to open an issue to show the current state, syntax and the planned roadmap towards the release (and future-roadmap) and open it for discussion ;) 
> Current State in short
> ( https://github.com/filiphaase/stratosphere/tree/langbinding )
> python-language-binding enables the user to write stratosphere jobs in python including the following operators: FileInputFormat, CSVOutputFormat, Join, Cross, CoGroup, Map, Reduce (without Combiner) and Union for all of them. 
> The execution can be locally or on a cluster. For cluster execution a job can be submitted via the stratosphere bashscript, whereby the java code of the framework is used as jar and the python-script of the user and the python files of the language-binding-framework are shipped via the configuration.
> To show you the syntax I setup a little documentation in GDocs: https://docs.google.com/document/d/1Caml9rr7irecKo32TmfM5p-ns9OUDhw4fQ5Xzb_wAS0/edit?usp=sharing
> And as always a WordCount-Example to show current syntax:
> ```python
> inputPath1 = r"file:///home/filip/Documents/stratosphere/hamlet.txt"
> outputPath = r"file:///home/filip/Documents/stratosphere/WCresult.txt"
> def split(record, collector):
>     filteredLine = re.sub(r"\W+", " ", record[0].lower())
>     [collector.collect((s, 1)) for s in filteredLine.split()]
>         
> def count(iter, collector):
>     sum = 0
>     record = None
>     
>     for val in iter:
>         record = val
>         sum += 1
>        
>     if(record != None):
>         collector.collect((record[0], int(sum)))
> TextInputFormat(inputPath1).map(split, [ValueType.String, ValueType.Int]) \
>     .reduce(count, [ValueType.String, ValueType.Int], keyInd = 0) \
>     .outputCSV(outputPath, [0,1], fieldDelimiter = " ", recordDelimiter = "\n"  \
>     .execute()
> ```
> Release-Roadmap
> - Add fault-tolerance and debugging possibilities for users. 
>     Currently stdin/stdout is used for IPC, therefore the user is not allowed to use print() and any debugging must be done over files. Furthermore it is possible that an error
> occurs in the python process and java is waiting endless for an answer.
> Solution:
> Use files for IPC (or directly shared memory if easily implementable) and do execution 
> with three threads: One for execution, one for stdout (to allow the user debuggin), one 
> for stderr and error-detection.
> - Adding pyInstaller in execution process. pyInstaller can pack the python script with all dependencies in an executable and therefore enables users to use any libraries and scripts which are only installed on the master-machine and not on the whole cluster.
> - Add missing functionality:
>     - ValueType.long
>     - CSVInputFormat
> - Add pyStratosphere bash script for call of python-lang-binding and enable user to hand command line arguments to python process. 
> Longterm-Roadmap(partly covered in mailing list “Contributing to the language binding”)
> - Missing functionality:
>     - Iterators
>     - Aggregators
>     - Accumulators and Counters
>     - Combinable
>     - Broadcast Variables
> - use shared memory/improved serialization/type handling for improved speed
> - "standalone" driver for the language binding to use it on "small" data on a local machine & for development
> - develop machine-learning use-case
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/528
> Created by: [filiphaase|https://github.com/filiphaase]
> Labels: 
> Created at: Mon Mar 03 17:50:24 CET 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.2#6252)