You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Yun Gao <yu...@aliyun.com.INVALID> on 2018/11/20 16:35:59 UTC

回复：[DISCUSS] Embracing Table API in Flink ML

Hi Weihua,

Thanks for the exciting proposal!

I have quickly read through it, and I really appropriate the idea of providing the ML Pipeline API similar to the commonly used library scikit-learn, since it greatly reduce the learning cost for the AI engineers to transfer to the Flink platform.

Currently we are also working on a related issue, namely enhancing the stream iteration of Flink to support both SGD and online learning, and it also support batch training as a special case. we have had a rough design and will start a new discussion in the next few days. I think the enhanced stream iteration will help to implement Estimators directly in Flink, and it may help to simplify the online learning pipeline by eliminating the requirement to load the models from external file systems.

I will read the design doc more carefully. Thanks again for sharing the design doc!

Yours sincerely
Yun Gao

------------------------------------------------------------------
发件人：Weihua Jiang <we...@gmail.com>
发送时间：2018年11月20日(星期二) 20:53
收件人：dev <de...@flink.apache.org>
主 题：[DISCUSS] Embracing Table API in Flink ML

ML Pipeline is the idea brought by Scikit-learn
<https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this
idea and made their own implementations [Spark ML Pipeline
<https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline
<https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html>].

NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
and DL pipelines.

ML Pipeline is quite helpful for model composition (i.e. using model(s) for
feature engineering) . And it enables logic reuse in train and inference
phases (via pipeline persistence and load), which is essential for AI
engineering. ML Pipeline can also be a good base for Flink based AI
engineering platform if we can make ML Pipeline have good tooling support
(i.e. meta data human readable).

As the Table API will be the unified high level API for both stream and
batch processing, I want to initiate the design discussion of new Table
based Flink ML Pipeline.

I drafted a design document [1] for this discussion. This design tries to
create a new ML Pipeline implementation so that concrete ML/DL algorithms
can fit to this new API to achieve interoperability.

Any feedback is highly appreciated.

Thanks

Weihua

[1]
https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing

Re: [DISCUSS] Embracing Table API in Flink ML

Posted by Weihua Jiang <we...@gmail.com>.

 It has pasted a while and I think we can move forward to JIRA discussion.
I will try to split the design into smaller pieces to make it more
understandable.

Actually, I have already implemented an initial version and ported some
flink.ml algorithms using this new API. Thus, we can have a better base for
design discussion.

Thanks
Weihua

Chen Qin <qi...@gmail.com> 于2018年11月21日周三 下午1:36写道：

> Hi Yun,
>
> Very excited to see Flink ML forward! There are many touch points your
> document touched. I couldn't agree more the value of having a (unified)
> table API could bring to Flink ecosystem towards running ML workload. Most
> ML pipelines we observed starts from single box python scripts or adhoc
> tools researcher run to train model on powerful machine. When that proves
> successful, they need to hook up with data warehouse and extract features
> (SQL kick in). In training phase, the landscape is very segmented. Small to
> median sized model can be trained on JVM, while large/deep model needs to
> optimize operator per iteration data random shuffle (SGD based DL) often
> ends up in JNI/ C++/Cuda and task scheduling.(gang scheduled instead of
> hack around map-reduce)
>
> Hope it makes sense. BTW, xgboost (most popular ML competition framework)
> has very primitive flink support, might worth check out.
> https://github.com/dmlc/xgboost
>
> Chen
>
> On Tue, Nov 20, 2018 at 6:13 PM Weihua Jiang <we...@gmail.com>
> wrote:
>
> > Hi Yun,
> >
> > Can't wait to see your design.
> >
> > Thanks
> > Weihua
> >
> > Yun Gao <yu...@aliyun.com.invalid> 于2018年11月21日周三 上午12:43写道：
> >
> > > Hi Weihua,
> > >
> > >     Thanks for the exciting proposal!
> > >
> > >     I have quickly read through it,  and I really appropriate the idea
> of
> > > providing the ML Pipeline API similar to the commonly used library
> > > scikit-learn, since it greatly reduce the learning cost for the AI
> > > engineers to transfer to the Flink platform.
> > >
> > >     Currently we are also working on a related issue, namely enhancing
> > the
> > > stream iteration of Flink to support both SGD and online learning, and
> it
> > > also support batch training as a special case. we have had a rough
> design
> > > and will start a new discussion in the next few days. I think the
> > enhanced
> > > stream iteration will help to implement Estimators directly in Flink,
> and
> > > it may help to simplify the online learning pipeline by eliminating the
> > > requirement to load the models from external file systems.
> > >
> > >     I will read the design doc more carefully. Thanks again for sharing
> > > the design doc!
> > >
> > > Yours sincerely
> > >     Yun Gao
> > >
> > >
> > > ------------------------------------------------------------------
> > > 发件人：Weihua Jiang <we...@gmail.com>
> > > 发送时间：2018年11月20日(星期二) 20:53
> > > 收件人：dev <de...@flink.apache.org>
> > > 主 题：[DISCUSS] Embracing Table API in Flink ML
> > >
> > > ML Pipeline is the idea brought by Scikit-learn
> > > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed
> > this
> > > idea and made their own implementations [Spark ML Pipeline
> > > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML
> > Pipeline
> > > <
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html
> > > >].
> > >
> > >
> > >
> > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both
> ML
> > > and DL pipelines.
> > >
> > >
> > > ML Pipeline is quite helpful for model composition (i.e. using model(s)
> > for
> > > feature engineering) . And it enables logic reuse in train and
> inference
> > > phases (via pipeline persistence and load), which is essential for AI
> > > engineering. ML Pipeline can also be a good base for Flink based AI
> > > engineering platform if we can make ML Pipeline have good tooling
> support
> > > (i.e. meta data human readable).
> > >
> > >
> > > As the Table API will be the unified high level API for both stream and
> > > batch processing, I want to initiate the design discussion of new Table
> > > based Flink ML Pipeline.
> > >
> > >
> > > I drafted a design document [1] for this discussion. This design tries
> to
> > > create a new ML Pipeline implementation so that concrete ML/DL
> algorithms
> > > can fit to this new API to achieve interoperability.
> > >
> > >
> > > Any feedback is highly appreciated.
> > >
> > >
> > > Thanks
> > >
> > > Weihua
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing
> > >
> > >
> >
>

Re: [DISCUSS] Embracing Table API in Flink ML

Posted by Chen Qin <qi...@gmail.com>.

Hi Yun,

Very excited to see Flink ML forward! There are many touch points your
document touched. I couldn't agree more the value of having a (unified)
table API could bring to Flink ecosystem towards running ML workload. Most
ML pipelines we observed starts from single box python scripts or adhoc
tools researcher run to train model on powerful machine. When that proves
successful, they need to hook up with data warehouse and extract features
(SQL kick in). In training phase, the landscape is very segmented. Small to
median sized model can be trained on JVM, while large/deep model needs to
optimize operator per iteration data random shuffle (SGD based DL) often
ends up in JNI/ C++/Cuda and task scheduling.(gang scheduled instead of
hack around map-reduce)

Hope it makes sense. BTW, xgboost (most popular ML competition framework)
has very primitive flink support, might worth check out.
https://github.com/dmlc/xgboost

Chen

On Tue, Nov 20, 2018 at 6:13 PM Weihua Jiang <we...@gmail.com> wrote:

> Hi Yun,
>
> Can't wait to see your design.
>
> Thanks
> Weihua
>
> Yun Gao <yu...@aliyun.com.invalid> 于2018年11月21日周三 上午12:43写道：
>
> > Hi Weihua,
> >
> >     Thanks for the exciting proposal!
> >
> >     I have quickly read through it,  and I really appropriate the idea of
> > providing the ML Pipeline API similar to the commonly used library
> > scikit-learn, since it greatly reduce the learning cost for the AI
> > engineers to transfer to the Flink platform.
> >
> >     Currently we are also working on a related issue, namely enhancing
> the
> > stream iteration of Flink to support both SGD and online learning, and it
> > also support batch training as a special case. we have had a rough design
> > and will start a new discussion in the next few days. I think the
> enhanced
> > stream iteration will help to implement Estimators directly in Flink, and
> > it may help to simplify the online learning pipeline by eliminating the
> > requirement to load the models from external file systems.
> >
> >     I will read the design doc more carefully. Thanks again for sharing
> > the design doc!
> >
> > Yours sincerely
> >     Yun Gao
> >
> >
> > ------------------------------------------------------------------
> > 发件人：Weihua Jiang <we...@gmail.com>
> > 发送时间：2018年11月20日(星期二) 20:53
> > 收件人：dev <de...@flink.apache.org>
> > 主 题：[DISCUSS] Embracing Table API in Flink ML
> >
> > ML Pipeline is the idea brought by Scikit-learn
> > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed
> this
> > idea and made their own implementations [Spark ML Pipeline
> > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML
> Pipeline
> > <
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html
> > >].
> >
> >
> >
> > NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
> > and DL pipelines.
> >
> >
> > ML Pipeline is quite helpful for model composition (i.e. using model(s)
> for
> > feature engineering) . And it enables logic reuse in train and inference
> > phases (via pipeline persistence and load), which is essential for AI
> > engineering. ML Pipeline can also be a good base for Flink based AI
> > engineering platform if we can make ML Pipeline have good tooling support
> > (i.e. meta data human readable).
> >
> >
> > As the Table API will be the unified high level API for both stream and
> > batch processing, I want to initiate the design discussion of new Table
> > based Flink ML Pipeline.
> >
> >
> > I drafted a design document [1] for this discussion. This design tries to
> > create a new ML Pipeline implementation so that concrete ML/DL algorithms
> > can fit to this new API to achieve interoperability.
> >
> >
> > Any feedback is highly appreciated.
> >
> >
> > Thanks
> >
> > Weihua
> >
> >
> > [1]
> >
> >
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing
> >
> >
>

Re: [DISCUSS] Embracing Table API in Flink ML

Posted by Weihua Jiang <we...@gmail.com>.

Hi Yun,

Can't wait to see your design.

Thanks
Weihua

Yun Gao <yu...@aliyun.com.invalid> 于2018年11月21日周三 上午12:43写道：

> Hi Weihua,
>
>     Thanks for the exciting proposal!
>
>     I have quickly read through it,  and I really appropriate the idea of
> providing the ML Pipeline API similar to the commonly used library
> scikit-learn, since it greatly reduce the learning cost for the AI
> engineers to transfer to the Flink platform.
>
>     Currently we are also working on a related issue, namely enhancing the
> stream iteration of Flink to support both SGD and online learning, and it
> also support batch training as a special case. we have had a rough design
> and will start a new discussion in the next few days. I think the enhanced
> stream iteration will help to implement Estimators directly in Flink, and
> it may help to simplify the online learning pipeline by eliminating the
> requirement to load the models from external file systems.
>
>     I will read the design doc more carefully. Thanks again for sharing
> the design doc!
>
> Yours sincerely
>     Yun Gao
>
>
> ------------------------------------------------------------------
> 发件人：Weihua Jiang <we...@gmail.com>
> 发送时间：2018年11月20日(星期二) 20:53
> 收件人：dev <de...@flink.apache.org>
> 主 题：[DISCUSS] Embracing Table API in Flink ML
>
> ML Pipeline is the idea brought by Scikit-learn
> <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this
> idea and made their own implementations [Spark ML Pipeline
> <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline
> <
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html
> >].
>
>
>
> NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
> and DL pipelines.
>
>
> ML Pipeline is quite helpful for model composition (i.e. using model(s) for
> feature engineering) . And it enables logic reuse in train and inference
> phases (via pipeline persistence and load), which is essential for AI
> engineering. ML Pipeline can also be a good base for Flink based AI
> engineering platform if we can make ML Pipeline have good tooling support
> (i.e. meta data human readable).
>
>
> As the Table API will be the unified high level API for both stream and
> batch processing, I want to initiate the design discussion of new Table
> based Flink ML Pipeline.
>
>
> I drafted a design document [1] for this discussion. This design tries to
> create a new ML Pipeline implementation so that concrete ML/DL algorithms
> can fit to this new API to achieve interoperability.
>
>
> Any feedback is highly appreciated.
>
>
> Thanks
>
> Weihua
>
>
> [1]
>
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing
>
>