You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Shaoxuan Wang (JIRA)" <ji...@apache.org> on 2019/05/10 02:24:00 UTC
[jira] [Created] (FLINK-12470) FLIP39: Flink ML pipeline and ML libs

Shaoxuan Wang created FLINK-12470:
-------------------------------------

             Summary: FLIP39: Flink ML pipeline and ML libs
                 Key: FLINK-12470
                 URL: https://issues.apache.org/jira/browse/FLINK-12470
             Project: Flink
          Issue Type: New Feature
          Components: Library / Machine Learning
    Affects Versions: 1.9.0
            Reporter: Shaoxuan Wang
            Assignee: Shaoxuan Wang
             Fix For: 1.9.0


This is the umbrella Jira for FLIP39, which intents to to enhance the scalability and the ease of use of Flink ML. 

ML Discussion thread: [http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-39-Flink-ML-pipeline-and-ML-libs-td28633.html]

Google Doc: (will convert it to an official confluence page very soon ) [https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo|https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo/edit]

In machine learning, there are mainly two types of people. The first type is MLlib developer. They need a set of standard/well abstracted core ML APIs to implement the algorithms. Every ML algorithm is a certain concrete implementation on top of these APIs. The second type is MLlib users who utilize the existing/packaged MLlib to train or server a model.  It is pretty common that the entire training or inference is constructed by a sequence of transformation or algorithms. It is essential to provide a workflow/pipeline API for MLlib users such that they can easily combine multiple algorithms to describe the ML workflow/pipeline.

Current Flink has a set of ML core inferences, but they are built on top of dataset API. This does not quite align with the latest flink [roadmap|https://flink.apache.org/roadmap.html] (TableAPI will become the first class citizen and primary API for analytics use cases, while dataset API will be gradually deprecated). Moreover, Flink at present does not have any interface that allows MLlib users to describe an ML workflow/pipeline, nor provides any approach to persist pipeline or model and reuse them in the future. To solve/improve these issues, in this FLIP we propose to:
 * Provide a new set of ML core interface (on top of Flink TableAPI)
 * Provide a ML pipeline interface (on top of Flink TableAPI)
 * Provide the interfaces for parameters management and pipeline persistence
 * All the above interfaces should facilitate any new ML algorithm. We will gradually add various standard ML algorithms on top of these new proposed interfaces to ensure their feasibility and scalability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)