You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Becket Qin <be...@gmail.com> on 2021/08/06 07:45:01 UTC

Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Hi Dong,

Sorry for the late reply.

I am a bit confused by this description of the semantic change. By "from
> Data -> Data conversion to generic Table -> Table", do you mean "Table !=
> Data"?


Yes, I think that Table and Data are not equivalent in this case. It might
depend on what people think of "data" in the ML scenario. To me a Table is
nothing but a containing format of a bunch of records. It neither
understands nor cares about the underlying meaning of these records. For
example, the records could be training samples, metrics, model parameters,
some control commands, etc. However, the input data of a "Transformer" in
ML pipelines are not really that generic. At very least, it would be
weird to have a Transformer take training samples and emit a model, because
that is exactly what Estimator is for. So in our scenario, making
transformers a generic records processing API seems a little
counterintuitive.

I don't have a definitive answer to what should be a transformer and what
should not be. A rough idea is that the output of a transformer can usually
be fed into an estimator as the training samples.

Yes, the intended change looks good to me.

Cheers,

Jiangjie (Becket) Qin

On Tue, Jul 20, 2021 at 9:55 PM Dong Lin <li...@gmail.com> wrote:

> Hi Becket,
>
> Thank you for the detailed reply!
>
> My understanding of your comments is that most of option-1 looks good
> except its change of the Transformer semantics. Please see my reply inline.
>
>
> On Tue, Jul 20, 2021 at 11:43 AM Becket Qin <be...@gmail.com> wrote:
>
> > Hi Dong, Zhipeng and Fan,
> >
> > Thanks for the detailed proposals. It is quite a lot of reading! Given
> that
> > we are introducing a lot of stuff here, I find that it might be easier
> for
> > people to discuss if we can list the fundamental differences first. From
> > what I understand, the very fundamental difference between the two
> > proposals is following:
> >
> > * In order to support graph structure, do we extend
> Transformer/Estimator,
> > or do we introduce a new set of API? *
> >
> > Proposal 1 tries to keep one set of API, which is based on
> > Transformer/Estimator/Pipeline. More specifically, it does the following:
> >     - Make Transformer and Estimator multi-input and multi-output (MIMO).
> >     - Introduce a Graph/GraphModel as counter parts of
> > Pipeline/PipelineModel.
> >
> > Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is.
> > Instead, it introduces AlgoOperators to support the graph structure. The
> > AlgoOperators are general-purpose graph nodes supporting MIMO. They are
> > independent of Pipeline / Transformer / Estimator.
> >
> >
> > My two cents:
> >
> > I think it is a big advantage to have a single set of API rather than two
> > independent sets of API, if possible. But I would suggest we change the
> > current proposal 1 a little bit, by learning from proposal 2.
> >
> > What I like about proposal 1:
> > 1. A single set of API, symmetric in Graph/GraphModel and
> > Pipeline/PipelineModel.
> > 2. Keeping most of the benefits from Transformer/Estimator, including the
> > fit-then-transform relation and save/load capability.
> >
> > However, proposal 1 also introduced some changes that I am not sure
> about:
> >
> > 1. The most controversial part of proposal 1 is whether we should extend
> > the Transformer/Estimator/Pipeline? In fact, different projects have
> > slightly different designs for Transformer/Estimator/Pipeline. So I think
> > it is OK to extend it. However, there are some commonly recognized core
> > semantics that we ideally want to keep. To me these core semantics are:
> >   1. Transformer is a Data -> Data conversion, Estimator deals with Data
> ->
> > Model conversion.
> >   2. Estimator.fit() gives a Transformer, and users can just call
> > Transformer.transform() to perform inference.
> > To me, as long as these core semantics are kept, extension to the API
> seems
> > acceptable.
> >
> > Proposal 1 extends the semantic of Transformer from Data -> Data
> conversion
> > to generic Table -> Table conversion, and claims it is equivalent to
> >
>
> I am a bit confused by this description of the semantic change. By "from
> Data -> Data conversion to generic Table -> Table", do you mean "Table !=
> Data"?
>
> I am hoping we can clearly define the semantics of Transformer and agree on
> its semantics definition within the community. Then we can see how option-1
> changes its semantics and update option-1 as appropriate.
>
> "AlgoOperator" in proposal 2 as a general-purpose graph node. It does
> > change the first semantic. That said, this might just be a naming
> problem,
> > though. One possible solution to this problem is having a new subclass of
> > Stage without strong conventional semantics, e.g. "AlgoOp" if we borrow
> the
> > name from proposal 2, and let Transformer extend it. Just like a
> >
>
> Assuming that option-1 does change the meaning of Transformer (this point
> depends on the answer to the previous question), I am happy to update
> option-1 to address the concern.
>
> Here is how I would update option-1:
> - Add an interface named "MLOperator". This interface will be the same as
> the existing Transformer interface in option-1 except its name.
> - Make Transformer a subclass of MLOperator. Because Transformer has
> exactly the same API as MLOperator, no extra method is defined in
> Transformer.
> - Every other API/class in option-1 which deals with Transformer will now
> deal with MLOperator.
>
> This change proposed above basically renamed Transformer to address the
> semantic/naming concern. Does this sound good?
>
>
> > PipelineModel is a more specific Transformer, a Transformer would be a
> more
> > specific "AlgoOp". If we do that, the processing logic that people don't
> > feel comfortable to be a Transformer can just be put into an "AlgoOp" and
> > thus can still be added to a Pipeline / Graph. This borrows the advantage
> > of proposal 2. In another word, this essentially makes the "AlgoOp"
> > equivalent of "AlgoOperator" in proposal 2, but allows it to be added to
> > the Graph and Pipeline if people want to.
> >
> > This also gives my thoughts regarding the concern that making the
> > Transformer/Estimator to MIMO would break the convention of single input
> > single output (SISO) Transformer/Estimator. Since this does not change
> the
> > core semantic of Transformer/Estimator, it sounds an intuitive extension
> to
> > me.
> >
> > 2. Another semantic related case brought up was heterogeneous topologies
> in
> > training and inference. In that case, the input of an Estimator would be
> > different from the input of the transformer returned by Estimator.fit().
> > The example to this case is Word2Vec, where the input of the Estimator
> > would be an article while the input to the Transformer is a single word.
> > The well recognized ML Pipeline doesn't seem to support this case,
> because
> > it assumes the input of the Estimator and corresponding Transformer are
> the
> > same.
> >
> > Both proposal 1 and proposal 2 leaves this case unsupported in the
> > Pipeline. To support this case,
> >    - Proposal 1 adds support to such cases in the Graph/GraphModel by
> > introducing "EstimatorInput" and "TransformerInput". The downside is that
> > it complicates the API.
> >    - Proposal 2 leaves this to users to construct two different DAG for
> > training and inference respectively. This means users would have to
> > construct the DAG twice even if most parts of the DAG are the same in
> > training and inference.
> >
> > My gut feeling is that this is not a critical difference because such
> > heterogeneous topology is sort of a corner case. Most users do not need
> to
> > worry about this. For those who do need this, either proposal 1 and
> > proposal 2 seems acceptable to me. That said, it looks that with proposal
> > 1, users can still choose to write the program twice without using the
> > Graph API, just like what they do in proposal 2. So technically speaking,
> > proposal 1 is more flexible and allows users to choose either flavor. On
> > the other hand, one could argue that proposal 1 may confuse users with
> > these two flavors. Although personally I feel it is clear to me, I am
> open
> > to other ideas.
> >
> > 3. Lastly, there was a concern about proposal 1 is that some Estimators
> can
> > no longer be added to the Pipeline while the original Pipeline accepts
> any
> > Estimator.
> >
> > It seems that users have to always make sure the input schema required by
> > the Estimator matches the input table. So even for the existing Pipeline,
> > people cannot naively add any Estimator into a pipeline. Admittedly,
> > proposal 1 added some more requirements, namely 1) the number of inputs
> > needs to match the number of outputs of the previous stage, and 2) the
> > Estimator does not generate a transformer with different required input
> > schema (the heterogeneous case mentioned above). However, given that
> these
> > mismatches will result in exceptions at compile time, just like users put
> > an Estimator with mismatched input schema, personally I find it does not
> > change the user experience much.
> >
> >
> > So to summarize my thoughts on this fundamental difference.
> >     - In general, I personally prefer having one set of API.
> >     - The current proposal 1 may need some improvements in some cases, by
> > borrowing something from proposal 2.
> >
> >
> >
> > A few other differences that I consider as non-fundamental:
> >
> > * Do we need a top level encapsulation API for an Algorithm? *
> >
> > Proposal 1 has a concept of Graph which encapsulates the entire algorithm
> > to provide a unified API following the same semantic of
> > Estimator/Transformer. Users can choose not to package everything into a
> > Graph, but just write their own program and wrap it as an ordinary
> > function.
> >
> > Proposal 2 does not have the top level API such as Graph. Instead, users
> > can choose to write an arbitrary function if they want to.
> >
> > From what I understand, in proposal 1, users may still choose to ignore
> > Graph API and simply construct a DAG by themselves by calling transform()
> > and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal 1
> as
> > I suggested earlier. So Graph is just an additional way to construct a
> > graph - people can use Graph in a similar way as they do to the
> > Pipeline/Pipeline model. In another word, there is no conflict between
> > proposal 1 and proposal 2.
> >
> >
> > * The ways to describe a Graph? *
> >
> > Proposal 1 gives two ways to construct a DAG.
> > 1. the raw API using Estimator/Transformer(potentially "AlgoOp" as well).
> > 2. using the GraphBuilder API.
> >
> > Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is a
> > main output and some other side outputs, so it can call
> > algoOp1.linkFrom(algoOp2) without specifying the index of the output, at
> > the cost of wrapping all the Tables into an AlgoOperator.
> >
> > The usability argument was mostly around the raw APIs. I don't think the
> > two APIs differ too much from each other. With the same assumption,
> > proposal 1 and proposal 2 can probably achieve very similar levels of
> > usability when describing a Graph, if not exactly the same.
> >
> >
> > There are some more other differences/arguments mentioned between the two
> > proposals. However, I don't think they are fundamental. And just like the
> > cases mentioned above, the two proposals can easily learn from each
> other.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <li...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two
> > different
> > > designs to extend Flink ML API to support more use-cases, e.g.
> > expressing a
> > > DAG of preprocessing and training logics. These two designs have been
> > > documented in FLIP-173
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783
> > > >
> > > 。
> > >
> > > We have different opinions on the usability and the
> ease-of-understanding
> > > of the proposed APIs. It will be really useful to have comments of
> those
> > > designs from the open source community and to learn your preferences.
> > >
> > > To facilitate the discussion, we have summarized our design principles
> > and
> > > opinions in this Google doc
> > > <
> > >
> >
> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do
> > > >.
> > > Code snippets for a few example use-cases are also provided in this doc
> > to
> > > demonstrate the difference between these two solutions.
> > >
> > > This Flink ML API is super important to the future of Flink ML library.
> > > Please feel free to reply to this email thread or comment in the Google
> > doc
> > > directly.
> > >
> > > Thank you!
> > > Dong, Zhipeng, Fan
> > >
> >
>