You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Weihua Jiang <we...@gmail.com> on 2018/11/20 12:53:28 UTC

[DISCUSS] Embracing Table API in Flink ML

ML Pipeline is the idea brought by Scikit-learn
<https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this
idea and made their own implementations [Spark ML Pipeline
<https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline
<https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html>].



NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
and DL pipelines.


ML Pipeline is quite helpful for model composition (i.e. using model(s) for
feature engineering) . And it enables logic reuse in train and inference
phases (via pipeline persistence and load), which is essential for AI
engineering. ML Pipeline can also be a good base for Flink based AI
engineering platform if we can make ML Pipeline have good tooling support
(i.e. meta data human readable).


As the Table API will be the unified high level API for both stream and
batch processing, I want to initiate the design discussion of new Table
based Flink ML Pipeline.


I drafted a design document [1] for this discussion. This design tries to
create a new ML Pipeline implementation so that concrete ML/DL algorithms
can fit to this new API to achieve interoperability.


Any feedback is highly appreciated.


Thanks

Weihua


[1]
https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing

Re: [DISCUSS] Embracing Table API in Flink ML

Posted by Weihua Jiang <we...@gmail.com>.

Hi Jincheng,

Thanks a lot for the warm feedback.

I've already read your Table API enhancement google doc. Those enhancements
are essential to implement any ML/DL algorithm on Table API. Our two
designs are perfectly complementary to each other. :)

Will add a section in my google doc for the implementation phased plan.

Thanks
Weihua

jincheng sun <su...@gmail.com> 于2018年11月20日周二 下午9:17写道：

> Hi Weihua，
> Thanks for bring up this discuss!
>
> I quickly read the google doc，and I fully agree that ML can be well
> supported on  TableAPI (at some stage in the future).
> In fact, Xiaowei and I have already brought up a discussion on enhancing
> the Table API. In the first phase, we will add support for
> map/flatmap/agg/flatagg in TableAPI.
> So I am very happy to be involved in this discussion and will leave a
> comment in the good doc later.
>
> I think It's grateful if you can add a phased implementation plan in google
> doc. What to do you think?
>
> Thanks,
> Jincheng
>
>
> Weihua Jiang <we...@gmail.com> 于2018年11月20日周二 下午8:53写道：
>
> > ML Pipeline is the idea brought by Scikit-learn
> > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed
> this
> > idea and made their own implementations [Spark ML Pipeline
> > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML
> Pipeline
> > <
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html
> > >].
> >
> >
> >
> > NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
> > and DL pipelines.
> >
> >
> > ML Pipeline is quite helpful for model composition (i.e. using model(s)
> for
> > feature engineering) . And it enables logic reuse in train and inference
> > phases (via pipeline persistence and load), which is essential for AI
> > engineering. ML Pipeline can also be a good base for Flink based AI
> > engineering platform if we can make ML Pipeline have good tooling support
> > (i.e. meta data human readable).
> >
> >
> > As the Table API will be the unified high level API for both stream and
> > batch processing, I want to initiate the design discussion of new Table
> > based Flink ML Pipeline.
> >
> >
> > I drafted a design document [1] for this discussion. This design tries to
> > create a new ML Pipeline implementation so that concrete ML/DL algorithms
> > can fit to this new API to achieve interoperability.
> >
> >
> > Any feedback is highly appreciated.
> >
> >
> > Thanks
> >
> > Weihua
> >
> >
> > [1]
> >
> >
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing
> >
>

Re: [DISCUSS] Embracing Table API in Flink ML

Posted by Weihua Jiang <we...@gmail.com>.

HI Becket,

Thanks a lot for the Table API enhancement design doc.

 I am working on some simple ML algorithm using this new ML pipeline. Will
feedback you if there is any Table enhancement needed.

Thanks
Weihua


Becket Qin <be...@gmail.com> 于2018年11月20日周二 下午10:43写道：

> Hi Weihua,
>
> Thanks for the well written design doc!
>
> The abstraction of ML pipeline is pretty handy to the AI engineers. As
> Jincheng mentioned, there is an undergoing effort to enhance the Table API
> for ML. But it would still be helpful to understand what is missing in
> Table API to fully support the ML pipeline. Given that there are quite a
> few proposed API and different related items to discuss, do you think
> having some examples of how the pipeline works would facilitate the
> discussion?
>
> Again, thanks for kicking off the discussion.
>
> Jiangjie (Becket) Qin
>
>
> On Tue, Nov 20, 2018 at 9:17 PM jincheng sun <su...@gmail.com>
> wrote:
>
> > Hi Weihua，
> > Thanks for bring up this discuss!
> >
> > I quickly read the google doc，and I fully agree that ML can be well
> > supported on  TableAPI (at some stage in the future).
> > In fact, Xiaowei and I have already brought up a discussion on enhancing
> > the Table API. In the first phase, we will add support for
> > map/flatmap/agg/flatagg in TableAPI.
> > So I am very happy to be involved in this discussion and will leave a
> > comment in the good doc later.
> >
> > I think It's grateful if you can add a phased implementation plan in
> google
> > doc. What to do you think?
> >
> > Thanks,
> > Jincheng
> >
> >
> > Weihua Jiang <we...@gmail.com> 于2018年11月20日周二 下午8:53写道：
> >
> > > ML Pipeline is the idea brought by Scikit-learn
> > > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed
> > this
> > > idea and made their own implementations [Spark ML Pipeline
> > > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML
> > Pipeline
> > > <
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html
> > > >].
> > >
> > >
> > >
> > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both
> ML
> > > and DL pipelines.
> > >
> > >
> > > ML Pipeline is quite helpful for model composition (i.e. using model(s)
> > for
> > > feature engineering) . And it enables logic reuse in train and
> inference
> > > phases (via pipeline persistence and load), which is essential for AI
> > > engineering. ML Pipeline can also be a good base for Flink based AI
> > > engineering platform if we can make ML Pipeline have good tooling
> support
> > > (i.e. meta data human readable).
> > >
> > >
> > > As the Table API will be the unified high level API for both stream and
> > > batch processing, I want to initiate the design discussion of new Table
> > > based Flink ML Pipeline.
> > >
> > >
> > > I drafted a design document [1] for this discussion. This design tries
> to
> > > create a new ML Pipeline implementation so that concrete ML/DL
> algorithms
> > > can fit to this new API to achieve interoperability.
> > >
> > >
> > > Any feedback is highly appreciated.
> > >
> > >
> > > Thanks
> > >
> > > Weihua
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing
> > >
> >
>

Re: [DISCUSS] Embracing Table API in Flink ML

Posted by Becket Qin <be...@gmail.com>.

Hi Weihua,

Thanks for the well written design doc!

The abstraction of ML pipeline is pretty handy to the AI engineers. As
Jincheng mentioned, there is an undergoing effort to enhance the Table API
for ML. But it would still be helpful to understand what is missing in
Table API to fully support the ML pipeline. Given that there are quite a
few proposed API and different related items to discuss, do you think
having some examples of how the pipeline works would facilitate the
discussion?

Again, thanks for kicking off the discussion.

Jiangjie (Becket) Qin


On Tue, Nov 20, 2018 at 9:17 PM jincheng sun <su...@gmail.com>
wrote:

> Hi Weihua，
> Thanks for bring up this discuss!
>
> I quickly read the google doc，and I fully agree that ML can be well
> supported on  TableAPI (at some stage in the future).
> In fact, Xiaowei and I have already brought up a discussion on enhancing
> the Table API. In the first phase, we will add support for
> map/flatmap/agg/flatagg in TableAPI.
> So I am very happy to be involved in this discussion and will leave a
> comment in the good doc later.
>
> I think It's grateful if you can add a phased implementation plan in google
> doc. What to do you think?
>
> Thanks,
> Jincheng
>
>
> Weihua Jiang <we...@gmail.com> 于2018年11月20日周二 下午8:53写道：
>
> > ML Pipeline is the idea brought by Scikit-learn
> > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed
> this
> > idea and made their own implementations [Spark ML Pipeline
> > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML
> Pipeline
> > <
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html
> > >].
> >
> >
> >
> > NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
> > and DL pipelines.
> >
> >
> > ML Pipeline is quite helpful for model composition (i.e. using model(s)
> for
> > feature engineering) . And it enables logic reuse in train and inference
> > phases (via pipeline persistence and load), which is essential for AI
> > engineering. ML Pipeline can also be a good base for Flink based AI
> > engineering platform if we can make ML Pipeline have good tooling support
> > (i.e. meta data human readable).
> >
> >
> > As the Table API will be the unified high level API for both stream and
> > batch processing, I want to initiate the design discussion of new Table
> > based Flink ML Pipeline.
> >
> >
> > I drafted a design document [1] for this discussion. This design tries to
> > create a new ML Pipeline implementation so that concrete ML/DL algorithms
> > can fit to this new API to achieve interoperability.
> >
> >
> > Any feedback is highly appreciated.
> >
> >
> > Thanks
> >
> > Weihua
> >
> >
> > [1]
> >
> >
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing
> >
>

Re: [DISCUSS] Embracing Table API in Flink ML

Posted by jincheng sun <su...@gmail.com>.

Hi Weihua，
Thanks for bring up this discuss!

I quickly read the google doc，and I fully agree that ML can be well
supported on  TableAPI (at some stage in the future).
In fact, Xiaowei and I have already brought up a discussion on enhancing
the Table API. In the first phase, we will add support for
map/flatmap/agg/flatagg in TableAPI.
So I am very happy to be involved in this discussion and will leave a
comment in the good doc later.

I think It's grateful if you can add a phased implementation plan in google
doc. What to do you think?

Thanks,
Jincheng


Weihua Jiang <we...@gmail.com> 于2018年11月20日周二 下午8:53写道：

> ML Pipeline is the idea brought by Scikit-learn
> <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this
> idea and made their own implementations [Spark ML Pipeline
> <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline
> <
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html
> >].
>
>
>
> NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
> and DL pipelines.
>
>
> ML Pipeline is quite helpful for model composition (i.e. using model(s) for
> feature engineering) . And it enables logic reuse in train and inference
> phases (via pipeline persistence and load), which is essential for AI
> engineering. ML Pipeline can also be a good base for Flink based AI
> engineering platform if we can make ML Pipeline have good tooling support
> (i.e. meta data human readable).
>
>
> As the Table API will be the unified high level API for both stream and
> batch processing, I want to initiate the design discussion of new Table
> based Flink ML Pipeline.
>
>
> I drafted a design document [1] for this discussion. This design tries to
> create a new ML Pipeline implementation so that concrete ML/DL algorithms
> can fit to this new API to achieve interoperability.
>
>
> Any feedback is highly appreciated.
>
>
> Thanks
>
> Weihua
>
>
> [1]
>
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing
>

Re: [DISCUSS] Embracing Table API in Flink ML

Posted by Weihua Jiang <we...@gmail.com>.

 It has pasted a while and I think we can move forward to JIRA discussion.
I will try to split the design into smaller pieces to make it more
understandable.

Actually, I have already implemented an initial version and ported some
flink.ml algorithms using this new API. Thus, we can have a better base for
design discussion.

Thanks
Weihua

Chen Qin <qi...@gmail.com> 于2018年11月21日周三 下午1:36写道：

> Hi Yun,
>
> Very excited to see Flink ML forward! There are many touch points your
> document touched. I couldn't agree more the value of having a (unified)
> table API could bring to Flink ecosystem towards running ML workload. Most
> ML pipelines we observed starts from single box python scripts or adhoc
> tools researcher run to train model on powerful machine. When that proves
> successful, they need to hook up with data warehouse and extract features
> (SQL kick in). In training phase, the landscape is very segmented. Small to
> median sized model can be trained on JVM, while large/deep model needs to
> optimize operator per iteration data random shuffle (SGD based DL) often
> ends up in JNI/ C++/Cuda and task scheduling.(gang scheduled instead of
> hack around map-reduce)
>
> Hope it makes sense. BTW, xgboost (most popular ML competition framework)
> has very primitive flink support, might worth check out.
> https://github.com/dmlc/xgboost
>
> Chen
>
> On Tue, Nov 20, 2018 at 6:13 PM Weihua Jiang <we...@gmail.com>
> wrote:
>
> > Hi Yun,
> >
> > Can't wait to see your design.
> >
> > Thanks
> > Weihua
> >
> > Yun Gao <yu...@aliyun.com.invalid> 于2018年11月21日周三 上午12:43写道：
> >
> > > Hi Weihua,
> > >
> > >     Thanks for the exciting proposal!
> > >
> > >     I have quickly read through it,  and I really appropriate the idea
> of
> > > providing the ML Pipeline API similar to the commonly used library
> > > scikit-learn, since it greatly reduce the learning cost for the AI
> > > engineers to transfer to the Flink platform.
> > >
> > >     Currently we are also working on a related issue, namely enhancing
> > the
> > > stream iteration of Flink to support both SGD and online learning, and
> it
> > > also support batch training as a special case. we have had a rough
> design
> > > and will start a new discussion in the next few days. I think the
> > enhanced
> > > stream iteration will help to implement Estimators directly in Flink,
> and
> > > it may help to simplify the online learning pipeline by eliminating the
> > > requirement to load the models from external file systems.
> > >
> > >     I will read the design doc more carefully. Thanks again for sharing
> > > the design doc!
> > >
> > > Yours sincerely
> > >     Yun Gao
> > >
> > >
> > > ------------------------------------------------------------------
> > > 发件人：Weihua Jiang <we...@gmail.com>
> > > 发送时间：2018年11月20日(星期二) 20:53
> > > 收件人：dev <de...@flink.apache.org>
> > > 主 题：[DISCUSS] Embracing Table API in Flink ML
> > >
> > > ML Pipeline is the idea brought by Scikit-learn
> > > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed
> > this
> > > idea and made their own implementations [Spark ML Pipeline
> > > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML
> > Pipeline
> > > <
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html
> > > >].
> > >
> > >
> > >
> > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both
> ML
> > > and DL pipelines.
> > >
> > >
> > > ML Pipeline is quite helpful for model composition (i.e. using model(s)
> > for
> > > feature engineering) . And it enables logic reuse in train and
> inference
> > > phases (via pipeline persistence and load), which is essential for AI
> > > engineering. ML Pipeline can also be a good base for Flink based AI
> > > engineering platform if we can make ML Pipeline have good tooling
> support
> > > (i.e. meta data human readable).
> > >
> > >
> > > As the Table API will be the unified high level API for both stream and
> > > batch processing, I want to initiate the design discussion of new Table
> > > based Flink ML Pipeline.
> > >
> > >
> > > I drafted a design document [1] for this discussion. This design tries
> to
> > > create a new ML Pipeline implementation so that concrete ML/DL
> algorithms
> > > can fit to this new API to achieve interoperability.
> > >
> > >
> > > Any feedback is highly appreciated.
> > >
> > >
> > > Thanks
> > >
> > > Weihua
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing
> > >
> > >
> >
>

Re: [DISCUSS] Embracing Table API in Flink ML

Posted by Chen Qin <qi...@gmail.com>.

Hi Yun,

Very excited to see Flink ML forward! There are many touch points your
document touched. I couldn't agree more the value of having a (unified)
table API could bring to Flink ecosystem towards running ML workload. Most
ML pipelines we observed starts from single box python scripts or adhoc
tools researcher run to train model on powerful machine. When that proves
successful, they need to hook up with data warehouse and extract features
(SQL kick in). In training phase, the landscape is very segmented. Small to
median sized model can be trained on JVM, while large/deep model needs to
optimize operator per iteration data random shuffle (SGD based DL) often
ends up in JNI/ C++/Cuda and task scheduling.(gang scheduled instead of
hack around map-reduce)

Hope it makes sense. BTW, xgboost (most popular ML competition framework)
has very primitive flink support, might worth check out.
https://github.com/dmlc/xgboost

Chen

On Tue, Nov 20, 2018 at 6:13 PM Weihua Jiang <we...@gmail.com> wrote:

> Hi Yun,
>
> Can't wait to see your design.
>
> Thanks
> Weihua
>
> Yun Gao <yu...@aliyun.com.invalid> 于2018年11月21日周三 上午12:43写道：
>
> > Hi Weihua,
> >
> >     Thanks for the exciting proposal!
> >
> >     I have quickly read through it,  and I really appropriate the idea of
> > providing the ML Pipeline API similar to the commonly used library
> > scikit-learn, since it greatly reduce the learning cost for the AI
> > engineers to transfer to the Flink platform.
> >
> >     Currently we are also working on a related issue, namely enhancing
> the
> > stream iteration of Flink to support both SGD and online learning, and it
> > also support batch training as a special case. we have had a rough design
> > and will start a new discussion in the next few days. I think the
> enhanced
> > stream iteration will help to implement Estimators directly in Flink, and
> > it may help to simplify the online learning pipeline by eliminating the
> > requirement to load the models from external file systems.
> >
> >     I will read the design doc more carefully. Thanks again for sharing
> > the design doc!
> >
> > Yours sincerely
> >     Yun Gao
> >
> >
> > ------------------------------------------------------------------
> > 发件人：Weihua Jiang <we...@gmail.com>
> > 发送时间：2018年11月20日(星期二) 20:53
> > 收件人：dev <de...@flink.apache.org>
> > 主 题：[DISCUSS] Embracing Table API in Flink ML
> >
> > ML Pipeline is the idea brought by Scikit-learn
> > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed
> this
> > idea and made their own implementations [Spark ML Pipeline
> > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML
> Pipeline
> > <
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html
> > >].
> >
> >
> >
> > NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
> > and DL pipelines.
> >
> >
> > ML Pipeline is quite helpful for model composition (i.e. using model(s)
> for
> > feature engineering) . And it enables logic reuse in train and inference
> > phases (via pipeline persistence and load), which is essential for AI
> > engineering. ML Pipeline can also be a good base for Flink based AI
> > engineering platform if we can make ML Pipeline have good tooling support
> > (i.e. meta data human readable).
> >
> >
> > As the Table API will be the unified high level API for both stream and
> > batch processing, I want to initiate the design discussion of new Table
> > based Flink ML Pipeline.
> >
> >
> > I drafted a design document [1] for this discussion. This design tries to
> > create a new ML Pipeline implementation so that concrete ML/DL algorithms
> > can fit to this new API to achieve interoperability.
> >
> >
> > Any feedback is highly appreciated.
> >
> >
> > Thanks
> >
> > Weihua
> >
> >
> > [1]
> >
> >
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing
> >
> >
>

Re: [DISCUSS] Embracing Table API in Flink ML

Posted by Weihua Jiang <we...@gmail.com>.

Hi Yun,

Can't wait to see your design.

Thanks
Weihua

Yun Gao <yu...@aliyun.com.invalid> 于2018年11月21日周三 上午12:43写道：

> Hi Weihua,
>
>     Thanks for the exciting proposal!
>
>     I have quickly read through it,  and I really appropriate the idea of
> providing the ML Pipeline API similar to the commonly used library
> scikit-learn, since it greatly reduce the learning cost for the AI
> engineers to transfer to the Flink platform.
>
>     Currently we are also working on a related issue, namely enhancing the
> stream iteration of Flink to support both SGD and online learning, and it
> also support batch training as a special case. we have had a rough design
> and will start a new discussion in the next few days. I think the enhanced
> stream iteration will help to implement Estimators directly in Flink, and
> it may help to simplify the online learning pipeline by eliminating the
> requirement to load the models from external file systems.
>
>     I will read the design doc more carefully. Thanks again for sharing
> the design doc!
>
> Yours sincerely
>     Yun Gao
>
>
> ------------------------------------------------------------------
> 发件人：Weihua Jiang <we...@gmail.com>
> 发送时间：2018年11月20日(星期二) 20:53
> 收件人：dev <de...@flink.apache.org>
> 主 题：[DISCUSS] Embracing Table API in Flink ML
>
> ML Pipeline is the idea brought by Scikit-learn
> <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this
> idea and made their own implementations [Spark ML Pipeline
> <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline
> <
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html
> >].
>
>
>
> NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
> and DL pipelines.
>
>
> ML Pipeline is quite helpful for model composition (i.e. using model(s) for
> feature engineering) . And it enables logic reuse in train and inference
> phases (via pipeline persistence and load), which is essential for AI
> engineering. ML Pipeline can also be a good base for Flink based AI
> engineering platform if we can make ML Pipeline have good tooling support
> (i.e. meta data human readable).
>
>
> As the Table API will be the unified high level API for both stream and
> batch processing, I want to initiate the design discussion of new Table
> based Flink ML Pipeline.
>
>
> I drafted a design document [1] for this discussion. This design tries to
> create a new ML Pipeline implementation so that concrete ML/DL algorithms
> can fit to this new API to achieve interoperability.
>
>
> Any feedback is highly appreciated.
>
>
> Thanks
>
> Weihua
>
>
> [1]
>
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing
>
>

回复：[DISCUSS] Embracing Table API in Flink ML

Posted by Yun Gao <yu...@aliyun.com.INVALID>.

Hi Weihua,

Thanks for the exciting proposal!

I have quickly read through it, and I really appropriate the idea of providing the ML Pipeline API similar to the commonly used library scikit-learn, since it greatly reduce the learning cost for the AI engineers to transfer to the Flink platform.

Currently we are also working on a related issue, namely enhancing the stream iteration of Flink to support both SGD and online learning, and it also support batch training as a special case. we have had a rough design and will start a new discussion in the next few days. I think the enhanced stream iteration will help to implement Estimators directly in Flink, and it may help to simplify the online learning pipeline by eliminating the requirement to load the models from external file systems.

I will read the design doc more carefully. Thanks again for sharing the design doc!

Yours sincerely
Yun Gao

------------------------------------------------------------------
发件人：Weihua Jiang <we...@gmail.com>
发送时间：2018年11月20日(星期二) 20:53
收件人：dev <de...@flink.apache.org>
主 题：[DISCUSS] Embracing Table API in Flink ML

ML Pipeline is the idea brought by Scikit-learn
<https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this
idea and made their own implementations [Spark ML Pipeline
<https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline
<https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html>].

NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
and DL pipelines.

ML Pipeline is quite helpful for model composition (i.e. using model(s) for
feature engineering) . And it enables logic reuse in train and inference
phases (via pipeline persistence and load), which is essential for AI
engineering. ML Pipeline can also be a good base for Flink based AI
engineering platform if we can make ML Pipeline have good tooling support
(i.e. meta data human readable).

As the Table API will be the unified high level API for both stream and
batch processing, I want to initiate the design discussion of new Table
based Flink ML Pipeline.

I drafted a design document [1] for this discussion. This design tries to
create a new ML Pipeline implementation so that concrete ML/DL algorithms
can fit to this new API to achieve interoperability.

Any feedback is highly appreciated.

Thanks

Weihua

[1]
https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing

Re: [DISCUSS] Embracing Table API in Flink ML

Posted by Weihua Jiang <we...@gmail.com>.

Hi Shaoxuan,

You are perfectly right. What I want to achieve is a combination of all
your 3 points. Let me rephrase here:
1. Define a Table based ML Pipeline interface to have the same
functionality as current DataSet based implementations.
2. Support new features like online learning, streaming inference.
3. Provide a base for Flink AI tooling (i.e. AI platform) and ML/DL SQL
support.

This definitely will be step-by-step actions and will need a lot of help
from Table enhancements. I am currently working on #1.

Thanks
Weihua

Shaoxuan Wang <ws...@gmail.com> 于2018年11月20日周二 下午11:11写道：

> Hi Weihua,
>
> Thanks for the proposal. I have quickly read through it. It looks great.
> A quick question. Do you consider changing the ML Lib (implementation
> of Estimator/Predictor/Transformer) also on top of the tableAPI? I
> will be very happy if this is also included in the scope. It is not
> easy and needs lots of new tableAPI functionalities, which is exactly
> one of the reasons that motivate us to "enhance the tableAPI"
> discussed in other threads.
>
> The entire scope of your proposal is so big that I would suggest we
> should complete it step by step. I think you have mainly proposed 3
> things:
> 1. Redesign the ML pipeline based on tableAPI
> 2. Take streaming ML pipeline into account
> 3. Enhance ML pipeline with some new features for a better user experience
> Maybe we should first replace the ml pipeline interface with tableAPI,
> then move into #2 and #3. In the meanwhile, we can also explore the
> possibility of changing the ML lib also on top of tableAPI. What do
> you think?
>
> BTW, we should not break the current ML pipeline interface (which is
> based on dataset) when we introduce the new ones. Let us leave it for
> a while before the new interface is completed and well adopted. Then
> we can deprecate the old ones.
>
> I will take a more thorough look at your proposal and leave comments
> directly on the doc.
>
> Regards,
> Shaoxuan
>
>
> On 11/20/18, Weihua Jiang <we...@gmail.com> wrote:
> > ML Pipeline is the idea brought by Scikit-learn
> > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed
> this
> > idea and made their own implementations [Spark ML Pipeline
> > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML
> Pipeline
> > <
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html
> >].
> >
> >
> >
> > NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
> > and DL pipelines.
> >
> >
> > ML Pipeline is quite helpful for model composition (i.e. using model(s)
> for
> > feature engineering) . And it enables logic reuse in train and inference
> > phases (via pipeline persistence and load), which is essential for AI
> > engineering. ML Pipeline can also be a good base for Flink based AI
> > engineering platform if we can make ML Pipeline have good tooling support
> > (i.e. meta data human readable).
> >
> >
> > As the Table API will be the unified high level API for both stream and
> > batch processing, I want to initiate the design discussion of new Table
> > based Flink ML Pipeline.
> >
> >
> > I drafted a design document [1] for this discussion. This design tries to
> > create a new ML Pipeline implementation so that concrete ML/DL algorithms
> > can fit to this new API to achieve interoperability.
> >
> >
> > Any feedback is highly appreciated.
> >
> >
> > Thanks
> >
> > Weihua
> >
> >
> > [1]
> >
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing
> >
>
>
> --
>
> -----------------------------------------------------------------------------------
>
> *Rome was not built in one day*
>
>
> -----------------------------------------------------------------------------------
>

Re: [DISCUSS] Embracing Table API in Flink ML

Posted by Shaoxuan Wang <ws...@gmail.com>.

Hi Weihua,

Thanks for the proposal. I have quickly read through it. It looks great.
A quick question. Do you consider changing the ML Lib (implementation
of Estimator/Predictor/Transformer) also on top of the tableAPI? I
will be very happy if this is also included in the scope. It is not
easy and needs lots of new tableAPI functionalities, which is exactly
one of the reasons that motivate us to "enhance the tableAPI"
discussed in other threads.

The entire scope of your proposal is so big that I would suggest we
should complete it step by step. I think you have mainly proposed 3
things:
1. Redesign the ML pipeline based on tableAPI
2. Take streaming ML pipeline into account
3. Enhance ML pipeline with some new features for a better user experience
Maybe we should first replace the ml pipeline interface with tableAPI,
then move into #2 and #3. In the meanwhile, we can also explore the
possibility of changing the ML lib also on top of tableAPI. What do
you think?

BTW, we should not break the current ML pipeline interface (which is
based on dataset) when we introduce the new ones. Let us leave it for
a while before the new interface is completed and well adopted. Then
we can deprecate the old ones.

I will take a more thorough look at your proposal and leave comments
directly on the doc.

Regards,
Shaoxuan

On 11/20/18, Weihua Jiang <we...@gmail.com> wrote:
> ML Pipeline is the idea brought by Scikit-learn
> <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this
> idea and made their own implementations [Spark ML Pipeline
> <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline
> <https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html>].
>
>
>
> NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
> and DL pipelines.
>
>
> ML Pipeline is quite helpful for model composition (i.e. using model(s) for
> feature engineering) . And it enables logic reuse in train and inference
> phases (via pipeline persistence and load), which is essential for AI
> engineering. ML Pipeline can also be a good base for Flink based AI
> engineering platform if we can make ML Pipeline have good tooling support
> (i.e. meta data human readable).
>
>
> As the Table API will be the unified high level API for both stream and
> batch processing, I want to initiate the design discussion of new Table
> based Flink ML Pipeline.
>
>
> I drafted a design document [1] for this discussion. This design tries to
> create a new ML Pipeline implementation so that concrete ML/DL algorithms
> can fit to this new API to achieve interoperability.
>
>
> Any feedback is highly appreciated.
>
>
> Thanks
>
> Weihua
>
>
> [1]
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing
>

-- 
-----------------------------------------------------------------------------------

*Rome was not built in one day*

-----------------------------------------------------------------------------------