You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Weihua Jiang <we...@gmail.com> on 2018/12/05 03:59:18 UTC
Re: [DISCUSS] Embracing Table API in Flink ML

 It has pasted a while and I think we can move forward to JIRA discussion.
I will try to split the design into smaller pieces to make it more
understandable.

Actually, I have already implemented an initial version and ported some
flink.ml algorithms using this new API. Thus, we can have a better base for
design discussion.

Thanks
Weihua

Chen Qin <qi...@gmail.com> 于2018年11月21日周三 下午1:36写道：

> Hi Yun,
>
> Very excited to see Flink ML forward! There are many touch points your
> document touched. I couldn't agree more the value of having a (unified)
> table API could bring to Flink ecosystem towards running ML workload. Most
> ML pipelines we observed starts from single box python scripts or adhoc
> tools researcher run to train model on powerful machine. When that proves
> successful, they need to hook up with data warehouse and extract features
> (SQL kick in). In training phase, the landscape is very segmented. Small to
> median sized model can be trained on JVM, while large/deep model needs to
> optimize operator per iteration data random shuffle (SGD based DL) often
> ends up in JNI/ C++/Cuda and task scheduling.(gang scheduled instead of
> hack around map-reduce)
>
> Hope it makes sense. BTW, xgboost (most popular ML competition framework)
> has very primitive flink support, might worth check out.
> https://github.com/dmlc/xgboost
>
> Chen
>
> On Tue, Nov 20, 2018 at 6:13 PM Weihua Jiang <we...@gmail.com>
> wrote:
>
> > Hi Yun,
> >
> > Can't wait to see your design.
> >
> > Thanks
> > Weihua
> >
> > Yun Gao <yu...@aliyun.com.invalid> 于2018年11月21日周三 上午12:43写道：
> >
> > > Hi Weihua,
> > >
> > >     Thanks for the exciting proposal!
> > >
> > >     I have quickly read through it,  and I really appropriate the idea
> of
> > > providing the ML Pipeline API similar to the commonly used library
> > > scikit-learn, since it greatly reduce the learning cost for the AI
> > > engineers to transfer to the Flink platform.
> > >
> > >     Currently we are also working on a related issue, namely enhancing
> > the
> > > stream iteration of Flink to support both SGD and online learning, and
> it
> > > also support batch training as a special case. we have had a rough
> design
> > > and will start a new discussion in the next few days. I think the
> > enhanced
> > > stream iteration will help to implement Estimators directly in Flink,
> and
> > > it may help to simplify the online learning pipeline by eliminating the
> > > requirement to load the models from external file systems.
> > >
> > >     I will read the design doc more carefully. Thanks again for sharing
> > > the design doc!
> > >
> > > Yours sincerely
> > >     Yun Gao
> > >
> > >
> > > ------------------------------------------------------------------
> > > 发件人：Weihua Jiang <we...@gmail.com>
> > > 发送时间：2018年11月20日(星期二) 20:53
> > > 收件人：dev <de...@flink.apache.org>
> > > 主 题：[DISCUSS] Embracing Table API in Flink ML
> > >
> > > ML Pipeline is the idea brought by Scikit-learn
> > > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed
> > this
> > > idea and made their own implementations [Spark ML Pipeline
> > > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML
> > Pipeline
> > > <
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html
> > > >].
> > >
> > >
> > >
> > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both
> ML
> > > and DL pipelines.
> > >
> > >
> > > ML Pipeline is quite helpful for model composition (i.e. using model(s)
> > for
> > > feature engineering) . And it enables logic reuse in train and
> inference
> > > phases (via pipeline persistence and load), which is essential for AI
> > > engineering. ML Pipeline can also be a good base for Flink based AI
> > > engineering platform if we can make ML Pipeline have good tooling
> support
> > > (i.e. meta data human readable).
> > >
> > >
> > > As the Table API will be the unified high level API for both stream and
> > > batch processing, I want to initiate the design discussion of new Table
> > > based Flink ML Pipeline.
> > >
> > >
> > > I drafted a design document [1] for this discussion. This design tries
> to
> > > create a new ML Pipeline implementation so that concrete ML/DL
> algorithms
> > > can fit to this new API to achieve interoperability.
> > >
> > >
> > > Any feedback is highly appreciated.
> > >
> > >
> > > Thanks
> > >
> > > Weihua
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing
> > >
> > >
> >
>