You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Stavros Kontopoulos <st...@gmail.com> on 2017/02/20 10:43:20 UTC

[DISCUSS] Flink ML roadmap

(Resending with the appropriate topic)

Hi,

I would like to start a discussion about next steps for Flink ML.
Currently there is a lot of work going on but needs a push forward.

Some topics to discuss:

a) How several features should be planned and get aligned with Flink
releases.
b) Priorities of what should be done.
c) Basic guidelines for code: styleguides, scikit-learn compliance etc
d) Missing features important for the success of the library, next steps
etc...

Thoughts?

Best,
Stavros

Re: [DISCUSS] Flink ML roadmap

Posted by Gábor Hermann <ma...@gaborhermann.com>.

@Theodore, thanks for taking lead in the coordination :)

Let's see what we can do, and then decide what should start out as an 
independent project, or strictly inside Flink.
I agree that something experimental like batch ML on streaming would 
probably benefit more an independent repo first.

On 2017-02-23 16:56, Theodore Vasiloudis wrote:

> Sure having a deadline for March 3rd is fine. I can act as coordinator,
> trying to guide the discussion to concrete results.
>
> For committers it's up to their discretion and time if one wants to
> participate. I don't think it's necessary to have one, but it would be most
> welcome.
>
> @Katherin I would suggest you start a topic on the list about FLINK-1730,
> if it takes a lot of development effort from your side it's best to at
> least try to gauge the community's interest, and whether there will be
> motivation to merge the changes.
>
> Maybe at the end of this we have a FLIP we can submit, that's probably the
> way forward if we want to keep this effort within the project. For a new,
> highly experimental project like batch ML on streaming I would actually
> favor developing on an independent repo, which can later be merged into
> main if there is interest.
>
> Regards.
> Theodore
>
> On Thu, Feb 23, 2017 at 4:41 PM, G�bor Hermann <ma...@gaborhermann.com>
> wrote:
>
>> Okay, let's just aim for around the end of next week, but we can take more
>> time to discuss if there's still a lot of ongoing activity. Keep the topic
>> hot!
>>
>> Thanks all for the enthusiasm :)
>>
>>
>>
>> On 2017-02-23 16:17, Stavros Kontopoulos wrote:
>>
>>> @Gabor 3rd March is ok for me. But maybe giving a bit more time to it like
>>> a week may suit more people.
>>> What do you think all?
>>> I will contribute to the doc.
>>>
>>> +100 for having a co-ordinator + commiter.
>>>
>>> Thank you all for joining the discussion.
>>>
>>> Cheers,
>>> Stavros
>>>
>>> On Thu, Feb 23, 2017 at 4:48 PM, G�bor Hermann <ma...@gaborhermann.com>
>>> wrote:
>>>
>>> Okay, I've created a skeleton of the design doc for choosing a direction:
>>>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
>>>> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
>>>>
>>>> Much of the pros/cons have already been discussed here, so I'll try to
>>>> put
>>>> there all the arguments mentioned in this thread. Feel free to put there
>>>> more :)
>>>>
>>>> @Stavros: I agree we should take action fast. What about collecting our
>>>> thoughts in the doc by around Tuesday next week (28. February)? Then
>>>> decide
>>>> on the direction and design a roadmap by around Friday (3. March)? Is
>>>> that
>>>> feasible, or should it take more time?
>>>>
>>>> I think it will be necessary to have a shepherd, or even better a
>>>> committer, to be involved in at least reviewing and accepting the
>>>> roadmap.
>>>> It would be best, if a committer coordinated all this.
>>>> @Theodore: Would you like to do the coordination?
>>>>
>>>> Regarding the use-cases: I've seen some abstracts of talks at SF Flink
>>>> Forward [1] that seem promising. There are companies already using Flink
>>>> for ML [2,3,4,5].
>>>>
>>>> [1] http://sf.flink-forward.org/program/sessions/
>>>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
>>>> eaming-vs-micro-batch-for-online-learning/
>>>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-te
>>>> nsorflow/
>>>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
>>>> arning-on-flink/
>>>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
>>>> ing-scenarios-with-flink/
>>>>
>>>> Cheers,
>>>> Gabor
>>>>
>>>>
>>>>
>>>> On 2017-02-23 15:19, Katherin Eri wrote:
>>>>
>>>> I have asked already some teams for useful cases, but all of them need
>>>>> time
>>>>> to think.
>>>>> During analysis something will finally arise.
>>>>> May be we can ask partners of Flink  for cases? Data Artisans got
>>>>> results
>>>>> of customers survey: [1], ML better support is wanted, so we could ask
>>>>> what
>>>>> exactly is necessary.
>>>>>
>>>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
>>>>>
>>>>> 23 \u0444\u0435\u0432\u0440. 2017 \u0433. 4:32 PM \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c "Stavros Kontopoulos" <
>>>>> st.kontopoulos@gmail.com> \u043d\u0430\u043f\u0438\u0441\u0430\u043b:
>>>>>
>>>>> +100 for a design doc.
>>>>>
>>>>>> Could we also set a roadmap after some time-boxed investigation
>>>>>> captured
>>>>>> in
>>>>>> that document? We need action.
>>>>>>
>>>>>> Looking forward to work on this (whatever that might be) ;) Also are
>>>>>> there
>>>>>> any data supporting one direction or the other from a customer
>>>>>> perspective?
>>>>>> It would help to make more informed decisions.
>>>>>>
>>>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <ka...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Yes, ok.
>>>>>>
>>>>>>> let's start some design document, and write down there already
>>>>>>> mentioned
>>>>>>> ideas about: parameter server, about clipper and others. Would be nice
>>>>>>> if
>>>>>>> we will also map this approaches to cases.
>>>>>>> Will work on it collaboratively on each topic, may be finally we will
>>>>>>>
>>>>>>> form
>>>>>> some picture, that could be agreed with committers.
>>>>>>> @Gabor, could you please start such shared doc, as you have already
>>>>>>>
>>>>>>> several
>>>>>> ideas proposed?
>>>>>>> \u0447\u0442, 23 \u0444\u0435\u0432\u0440. 2017, 15:06 G�bor Hermann <ma...@gaborhermann.com>:
>>>>>>>
>>>>>>> I agree, that it's better to go in one direction first, but I think
>>>>>>>
>>>>>>>> online and offline with streaming API can go somewhat parallel later.
>>>>>>>>
>>>>>>>> We
>>>>>>> could set a short-term goal, concentrate initially on one direction,
>>>>>>> and
>>>>>>> showcase that direction (e.g. in a blogpost). But first, we should
>>>>>>> list
>>>>>>>
>>>>>>>> the pros/cons in a design doc as a minimum. Then make a decision what
>>>>>>>> direction to go. Would that be feasible?
>>>>>>>>
>>>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
>>>>>>>>
>>>>>>>> I'm not sure that this is feasible, doing all at the same time could
>>>>>>>> mean
>>>>>>>> doing nothing((((
>>>>>>>>
>>>>>>>>> I'm just afraid, that words: we will work on streaming not on
>>>>>>>>>
>>>>>>>>> batching,
>>>>>>> we
>>>>>>>
>>>>>>>> have no commiter's time for this, mean that yes, we started work on
>>>>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
>>>>>>>>>
>>>>>>>>> already
>>>>>>> was
>>>>>>>
>>>>>>>> with this ticket.
>>>>>>>>> 23 \u0444\u0435\u0432\u0440. 2017 \u0433. 14:26 \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c "G�bor Hermann" <
>>>>>>>>>
>>>>>>>>> mail@gaborhermann.com>
>>>>>>>> \u043d\u0430\u043f\u0438\u0441\u0430\u043b:
>>>>>>>>> @Theodore: Great to hear you think the "batch on streaming" approach
>>>>>>>>> is
>>>>>>>>>
>>>>>>>> possible! Of course, we need to pay attention all the pitfalls
>>>>>>>>
>>>>>>>>> there,
>>>>>>>>>
>>>>>>>> if we
>>>>>>>> go that way.
>>>>>>>>>> +1 for a design doc!
>>>>>>>>>>
>>>>>>>>>> I would add that it's possible to make efforts in all the three
>>>>>>>>>>
>>>>>>>>>> directions
>>>>>>>>> (i.e. batch, online, batch on streaming) at the same time. Although,
>>>>>>>>> it
>>>>>>>>>
>>>>>>>> might be worth to concentrate on one. E.g. it would not be so useful
>>>>>>>>
>>>>>>>>> to
>>>>>>>>>
>>>>>>>> have the same batch algorithms with both the batch API and streaming
>>>>>>>>
>>>>>>>>> API.
>>>>>>>>> We can decide later.
>>>>>>>>>
>>>>>>>>>> The design doc could be partitioned to these 3 directions, and we
>>>>>>>>>>
>>>>>>>>>> can
>>>>>>>> collect there the pros/cons too. What do you think?
>>>>>>>> Cheers,
>>>>>>>>>> Gabor
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>>>>>>>>>>
>>>>>>>>>> Hello all,
>>>>>>>>>>
>>>>>>>>>>> @Gabor, we have discussed the idea of using the streaming API to
>>>>>>>>>>>
>>>>>>>>>>> write
>>>>>>>>> all
>>>>>>>> of our ML algorithms with a couple of people offline,
>>>>>>>>>> and I think it might be possible and is generally worth a shot. The
>>>>>>>>>>> approach we would take would be close to Vowpal Wabbit, not
>>>>>>>>>>> exactly
>>>>>>>>>>> "online", but rather "fast-batch".
>>>>>>>>>>>
>>>>>>>>>>> There will be problems popping up again, even for very simple
>>>>>>>>>>> algos
>>>>>>>>>>>
>>>>>>>>>>> like
>>>>>>>>> on
>>>>>>>>>
>>>>>>>>>> line linear regression with SGD [1], but hopefully fixing those
>>>>>>>>>>> will
>>>>>>>>> be
>>>>>>> more aligned with the priorities of the community.
>>>>>>>>> @Katherin, my understanding is that given the limited resources,
>>>>>>>>>>> there
>>>>>>>>> is
>>>>>>>> no development effort focused on batch processing right now.
>>>>>>>>>> So to summarize, it seems like there are people willing to work on
>>>>>>>>>>> ML
>>>>>>>>> on
>>>>>>>> Flink, but nobody is sure how to do it.
>>>>>>>>>> There are many directions we could take (batch, online, batch on
>>>>>>>>>>> streaming), each with its own merits and downsides.
>>>>>>>>>>>
>>>>>>>>>>> If you want we can start a design doc and move the conversation
>>>>>>>>>>>
>>>>>>>>>>> there,
>>>>>>>>> come
>>>>>>>>> up with a roadmap and start implementing.
>>>>>>>>>>> Regards,
>>>>>>>>>>> Theodore
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
>>>>>>>>>>> tamps-td10241.html
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, G�bor Hermann <
>>>>>>>>>>>
>>>>>>>>>>> mail@gaborhermann.com
>>>>>>>>> wrote:
>>>>>>>>> It's great to see so much activity in this discussion :)
>>>>>>>>>>> I'll try to add my thoughts.
>>>>>>>>>>>> I think building a developer community (Till's 2. point) can be
>>>>>>>>>>>>
>>>>>>>>>>>> slightly
>>>>>>>>>> separated from what features we should aim for (1. point) and
>>>>>>>>>> showcasing
>>>>>>>>>> (3. point). Thanks Till for bringing up the ideas for
>>>>>>>>>> restructuring,
>>>>>>>>>> I'm
>>>>>>>> sure we'll find a way to make the development process more
>>>>>>>>>> dynamic.
>>>>>>>>>> I'll
>>>>>>>> try to address the rest here.
>>>>>>>>>> It's hard to choose directions between streaming and batch ML. As
>>>>>>>>>>>> Theo
>>>>>>>>>> has
>>>>>>>>> indicated, not much online ML is used in production, but Flink
>>>>>>>>>>>> concentrates
>>>>>>>>>>>> on streaming, so online ML would be a better fit for Flink.
>>>>>>>>>>>>
>>>>>>>>>>>> However,
>>>>>>>>>> as
>>>>>>>> most of you argued, there's definite need for batch ML. But batch
>>>>>>>>>> ML
>>>>>>>>>> seems
>>>>>>>> hard to achieve because there are blocking issues with persisting,
>>>>>>>>>>>> iteration paths etc. So it's no good either way.
>>>>>>>>>>>>
>>>>>>>>>>>> I propose a seemingly crazy solution: what if we developed batch
>>>>>>>>>>>> algorithms also with the streaming API? The batch API would
>>>>>>>>>>>>
>>>>>>>>>>>> clearly
>>>>>>>>>> seem
>>>>>>>> more suitable for ML algorithms, but there a lot of benefits of
>>>>>>>>>> this
>>>>>>>>>> approach too, so it's clearly worth considering. Flink also has
>>>>>>>> the
>>>>>>>>>> high
>>>>>>>> level vision of "streaming for everything" that would clearly fit
>>>>>>>>>> this
>>>>>>>>>> case. What do you all think about this? Do you think this solution
>>>>>>>>> would
>>>>>>>>>> be
>>>>>>>>>> feasible? I would be happy to make a more elaborate proposal, but
>>>>>>>>>>>> I
>>>>>>>>>> push
>>>>>>>> my
>>>>>>>>>> main ideas here:
>>>>>>>>>>>> 1) Simplifying by using one system
>>>>>>>>>>>> It could simplify the work of both the users and the developers.
>>>>>>>>>>>>
>>>>>>>>>>>> One
>>>>>>>>>> could
>>>>>>>> execute training once, or could execute it periodically e.g. by
>>>>>>>>>>>> using
>>>>>>>>>> windows. Low-latency serving and training could be done in the
>>>>>>>>> same
>>>>>>>>>> system.
>>>>>>>> We could implement incremental algorithms, without any side inputs
>>>>>>>>>>>> for
>>>>>>>>>> combining online learning (or predictions) with batch learning. Of
>>>>>>>>> course,
>>>>>>>>>>>> all the logic describing these must be somehow implemented (e.g.
>>>>>>>>>>>> synchronizing predictions with training), but it should be easier
>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>> do
>>>>>>>> so
>>>>>>>>>> in one system, than by combining e.g. the batch and streaming API.
>>>>>>>>>>>> 2) Batch ML with the streaming API is not harder
>>>>>>>>>>>> Despite these benefits, it could seem harder to implement batch
>>>>>>>>>>>> ML
>>>>>>>>>>>>
>>>>>>>>>>>> with
>>>>>>>>>> the streaming API, but in my opinion it's not. There are more
>>>>>>>>>> flexible,
>>>>>>>>>> lower-level optimization potentials with the streaming API. Most
>>>>>>>>>> distributed ML algorithms use a lower-level model than the batch
>>>>>>>>>>>> API
>>>>>>>>>> anyway, so sometimes it feels like forcing the algorithm logic
>>>>>>>> into
>>>>>>>>>> the
>>>>>>>> training API and tweaking it. Although we could not use the batch
>>>>>>>>>> primitives like join, we would have the E.g. in my experience with
>>>>>>>>>>>> implementing a distributed matrix factorization algorithm [1], I
>>>>>>>>>>>>
>>>>>>>>>>>> couldn't
>>>>>>>>>> do a simple optimization because of the limitations of the
>>>>>>>>>> iteration
>>>>>>>>>> API
>>>>>>>> [2]. Even if we pushed all the development effort to make the
>>>>>>>>>> batch
>>>>>>>>>> API
>>>>>>>> more suitable for ML there would be things we couldn't do. E.g.
>>>>>>>>>> there
>>>>>>>>>> are
>>>>>>>> approaches for updating a model iteratively without locks [3,4]
>>>>>>>>>> (i.e.
>>>>>>>>>> somewhat asynchronously), and I don't see a clear way to implement
>>>>>>>>> such
>>>>>>>>>> algorithms with the batch API.
>>>>>>>>>> 3) Streaming community (users and devs) benefit
>>>>>>>>>>>> The Flink streaming community in general would also benefit from
>>>>>>>>>>>>
>>>>>>>>>>>> this
>>>>>>>>>> direction. There are many features needed in the streaming API for
>>>>>>>>> ML
>>>>>>>>>> to
>>>>>>>> work, but this is also true for the batch API. One really
>>>>>>>>>> important
>>>>>>>>>> is
>>>>>>> the
>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot
>>>>>>>>>>>> of
>>>>>>>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
>>>>>>>>> mentioned
>>>>>>>>>>>> using GPUs, and I'm sure they have uses in streaming generally
>>>>>>>>>>>>
>>>>>>>>>>>> [7].
>>>>>>>>>> Thus,
>>>>>>>> by improving the streaming API to allow ML algorithms, the
>>>>>>>>>> streaming
>>>>>>>>>> API
>>>>>>>> benefit too (which is important as they have a lot more production
>>>>>>>>>> users
>>>>>>>>>> than the batch API).
>>>>>>>>>> 4) Performance can be at least as good
>>>>>>>>>>>> I believe the same performance could be achieved with the
>>>>>>>>>>>>
>>>>>>>>>>>> streaming
>>>>>>>>>> API
>>>>>>>> as
>>>>>>>>>> with the batch API. Streaming API is much closer to the runtime
>>>>>>>>>>>> than
>>>>>>>>>> the
>>>>>>>> batch API. For corner-cases, with runtime-layer optimizations of
>>>>>>>>>> batch
>>>>>>>>>> API,
>>>>>>>>> we could find a way to do the same (or similar) optimization for
>>>>>>>>>>>> the
>>>>>>>>>> streaming API (see my previous point). Such case could be using
>>>>>>>> managed
>>>>>>>>>> memory (and spilling to disk). There are also benefits by default,
>>>>>>>>>> e.g.
>>>>>>>>>> we
>>>>>>>>>> would have a finer grained fault tolerance with the streaming API.
>>>>>>>>>>>> 5) We could keep batch ML API
>>>>>>>>>>>> For the shorter term, we should not throw away all the algorithms
>>>>>>>>>>>> implemented with the batch API. By pushing forward the
>>>>>>>>>>>> development
>>>>>>>>>>>>
>>>>>>>>>>>> with
>>>>>>>>>> side inputs we could make them usable with streaming API. Then, if
>>>>>>>>>> the
>>>>>>>>>> library gains some popularity, we could replace the algorithms in
>>>>>>>>> the
>>>>>>>>>> batch
>>>>>>>>> API with streaming ones, to avoid the performance costs of e.g.
>>>>>>>>>>>> not
>>>>>>>>>> being
>>>>>>>> able to persist.
>>>>>>>>>> 6) General tools for implementing ML algorithms
>>>>>>>>>>>> Besides implementing algorithms one by one, we could give more
>>>>>>>>>>>>
>>>>>>>>>>>> general
>>>>>>>>>> tools for making it easier to implement algorithms. E.g. parameter
>>>>>>>>> server
>>>>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
>>>>>>>>>> similar
>>>>>>>>>>>> model to Flink streaming, we could look into that too. I think
>>>>>>>>>>>>
>>>>>>>>>>>> often
>>>>>>>>>> when
>>>>>>>> deploying a production ML system, much more configuration and
>>>>>>>>>> tweaking
>>>>>>>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>>>>>>>> 7) Showcasing
>>>>>>>>>>>> Showcasing this could be easier. We could say that we're doing
>>>>>>>>>>>>
>>>>>>>>>>>> batch
>>>>>>>>>> ML
>>>>>>>> with a streaming API. That's interesting in its own. IMHO this
>>>>>>>>>> integration
>>>>>>>>>>>> is also a more approachable way towards end-to-end ML.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for reading so far :)
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
>>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>>>>>>>>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pd
>>>>>>>>>>>> f
>>>>>>>>>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>>>>>>>>>>> 13-final77.pdf
>>>>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>>>>>>>>>>> Scoped+Loops+and+Job+Termination
>>>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
>>>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.
>>>>>>>>>>>>
>>>>>>>>>>>> pdf
>>>>>>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>>>>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>>>>>>>>>>> Parameter-Server-implementation-td15880.html
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Gabor
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>> *Yours faithfully, *
>>>>>>> *Kate Eri.*
>>>>>>>
>>>>>>>
>>>>>>>

Re: [DISCUSS] Flink ML roadmap

Posted by Theodore Vasiloudis <th...@gmail.com>.

Sure having a deadline for March 3rd is fine. I can act as coordinator,
trying to guide the discussion to concrete results.

For committers it's up to their discretion and time if one wants to
participate. I don't think it's necessary to have one, but it would be most
welcome.

@Katherin I would suggest you start a topic on the list about FLINK-1730,
if it takes a lot of development effort from your side it's best to at
least try to gauge the community's interest, and whether there will be
motivation to merge the changes.

Maybe at the end of this we have a FLIP we can submit, that's probably the
way forward if we want to keep this effort within the project. For a new,
highly experimental project like batch ML on streaming I would actually
favor developing on an independent repo, which can later be merged into
main if there is interest.

Regards.
Theodore

On Thu, Feb 23, 2017 at 4:41 PM, Gábor Hermann <ma...@gaborhermann.com>
wrote:

> Okay, let's just aim for around the end of next week, but we can take more
> time to discuss if there's still a lot of ongoing activity. Keep the topic
> hot!
>
> Thanks all for the enthusiasm :)
>
>
>
> On 2017-02-23 16:17, Stavros Kontopoulos wrote:
>
>> @Gabor 3rd March is ok for me. But maybe giving a bit more time to it like
>> a week may suit more people.
>> What do you think all?
>> I will contribute to the doc.
>>
>> +100 for having a co-ordinator + commiter.
>>
>> Thank you all for joining the discussion.
>>
>> Cheers,
>> Stavros
>>
>> On Thu, Feb 23, 2017 at 4:48 PM, Gábor Hermann <ma...@gaborhermann.com>
>> wrote:
>>
>> Okay, I've created a skeleton of the design doc for choosing a direction:
>>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
>>> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
>>>
>>> Much of the pros/cons have already been discussed here, so I'll try to
>>> put
>>> there all the arguments mentioned in this thread. Feel free to put there
>>> more :)
>>>
>>> @Stavros: I agree we should take action fast. What about collecting our
>>> thoughts in the doc by around Tuesday next week (28. February)? Then
>>> decide
>>> on the direction and design a roadmap by around Friday (3. March)? Is
>>> that
>>> feasible, or should it take more time?
>>>
>>> I think it will be necessary to have a shepherd, or even better a
>>> committer, to be involved in at least reviewing and accepting the
>>> roadmap.
>>> It would be best, if a committer coordinated all this.
>>> @Theodore: Would you like to do the coordination?
>>>
>>> Regarding the use-cases: I've seen some abstracts of talks at SF Flink
>>> Forward [1] that seem promising. There are companies already using Flink
>>> for ML [2,3,4,5].
>>>
>>> [1] http://sf.flink-forward.org/program/sessions/
>>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
>>> eaming-vs-micro-batch-for-online-learning/
>>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-te
>>> nsorflow/
>>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
>>> arning-on-flink/
>>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
>>> ing-scenarios-with-flink/
>>>
>>> Cheers,
>>> Gabor
>>>
>>>
>>>
>>> On 2017-02-23 15:19, Katherin Eri wrote:
>>>
>>> I have asked already some teams for useful cases, but all of them need
>>>> time
>>>> to think.
>>>> During analysis something will finally arise.
>>>> May be we can ask partners of Flink  for cases? Data Artisans got
>>>> results
>>>> of customers survey: [1], ML better support is wanted, so we could ask
>>>> what
>>>> exactly is necessary.
>>>>
>>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
>>>>
>>>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
>>>> st.kontopoulos@gmail.com> написал:
>>>>
>>>> +100 for a design doc.
>>>>
>>>>> Could we also set a roadmap after some time-boxed investigation
>>>>> captured
>>>>> in
>>>>> that document? We need action.
>>>>>
>>>>> Looking forward to work on this (whatever that might be) ;) Also are
>>>>> there
>>>>> any data supporting one direction or the other from a customer
>>>>> perspective?
>>>>> It would help to make more informed decisions.
>>>>>
>>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <ka...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Yes, ok.
>>>>>
>>>>>> let's start some design document, and write down there already
>>>>>> mentioned
>>>>>> ideas about: parameter server, about clipper and others. Would be nice
>>>>>> if
>>>>>> we will also map this approaches to cases.
>>>>>> Will work on it collaboratively on each topic, may be finally we will
>>>>>>
>>>>>> form
>>>>>
>>>>> some picture, that could be agreed with committers.
>>>>>> @Gabor, could you please start such shared doc, as you have already
>>>>>>
>>>>>> several
>>>>>
>>>>> ideas proposed?
>>>>>>
>>>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <ma...@gaborhermann.com>:
>>>>>>
>>>>>> I agree, that it's better to go in one direction first, but I think
>>>>>>
>>>>>>> online and offline with streaming API can go somewhat parallel later.
>>>>>>>
>>>>>>> We
>>>>>> could set a short-term goal, concentrate initially on one direction,
>>>>>> and
>>>>>> showcase that direction (e.g. in a blogpost). But first, we should
>>>>>> list
>>>>>>
>>>>>>> the pros/cons in a design doc as a minimum. Then make a decision what
>>>>>>> direction to go. Would that be feasible?
>>>>>>>
>>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
>>>>>>>
>>>>>>> I'm not sure that this is feasible, doing all at the same time could
>>>>>>> mean
>>>>>>> doing nothing((((
>>>>>>>
>>>>>>>> I'm just afraid, that words: we will work on streaming not on
>>>>>>>>
>>>>>>>> batching,
>>>>>>>
>>>>>> we
>>>>>>
>>>>>>> have no commiter's time for this, mean that yes, we started work on
>>>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
>>>>>>>>
>>>>>>>> already
>>>>>>>
>>>>>> was
>>>>>>
>>>>>>> with this ticket.
>>>>>>>>
>>>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
>>>>>>>>
>>>>>>>> mail@gaborhermann.com>
>>>>>>>
>>>>>>> написал:
>>>>>>>>
>>>>>>>> @Theodore: Great to hear you think the "batch on streaming" approach
>>>>>>>> is
>>>>>>>>
>>>>>>> possible! Of course, we need to pay attention all the pitfalls
>>>>>>>
>>>>>>>> there,
>>>>>>>>
>>>>>>> if we
>>>>>>
>>>>>>> go that way.
>>>>>>>>
>>>>>>>>> +1 for a design doc!
>>>>>>>>>
>>>>>>>>> I would add that it's possible to make efforts in all the three
>>>>>>>>>
>>>>>>>>> directions
>>>>>>>> (i.e. batch, online, batch on streaming) at the same time. Although,
>>>>>>>> it
>>>>>>>>
>>>>>>> might be worth to concentrate on one. E.g. it would not be so useful
>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>> have the same batch algorithms with both the batch API and streaming
>>>>>>>
>>>>>>>> API.
>>>>>>>> We can decide later.
>>>>>>>>
>>>>>>>>> The design doc could be partitioned to these 3 directions, and we
>>>>>>>>>
>>>>>>>>> can
>>>>>>>>
>>>>>>> collect there the pros/cons too. What do you think?
>>>>>>
>>>>>>> Cheers,
>>>>>>>>> Gabor
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>>>>>>>>>
>>>>>>>>> Hello all,
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> @Gabor, we have discussed the idea of using the streaming API to
>>>>>>>>>>
>>>>>>>>>> write
>>>>>>>>>
>>>>>>>> all
>>>>>>>
>>>>>>> of our ML algorithms with a couple of people offline,
>>>>>>>>
>>>>>>>>> and I think it might be possible and is generally worth a shot. The
>>>>>>>>>> approach we would take would be close to Vowpal Wabbit, not
>>>>>>>>>> exactly
>>>>>>>>>> "online", but rather "fast-batch".
>>>>>>>>>>
>>>>>>>>>> There will be problems popping up again, even for very simple
>>>>>>>>>> algos
>>>>>>>>>>
>>>>>>>>>> like
>>>>>>>>>
>>>>>>>> on
>>>>>>>>
>>>>>>>>> line linear regression with SGD [1], but hopefully fixing those
>>>>>>>>>>
>>>>>>>>>> will
>>>>>>>>>
>>>>>>>> be
>>>>>>
>>>>>> more aligned with the priorities of the community.
>>>>>>>
>>>>>>>> @Katherin, my understanding is that given the limited resources,
>>>>>>>>>>
>>>>>>>>>> there
>>>>>>>>>
>>>>>>>> is
>>>>>>>
>>>>>>> no development effort focused on batch processing right now.
>>>>>>>>
>>>>>>>>> So to summarize, it seems like there are people willing to work on
>>>>>>>>>>
>>>>>>>>>> ML
>>>>>>>>>
>>>>>>>> on
>>>>>>
>>>>>>> Flink, but nobody is sure how to do it.
>>>>>>>>
>>>>>>>>> There are many directions we could take (batch, online, batch on
>>>>>>>>>> streaming), each with its own merits and downsides.
>>>>>>>>>>
>>>>>>>>>> If you want we can start a design doc and move the conversation
>>>>>>>>>>
>>>>>>>>>> there,
>>>>>>>>>
>>>>>>>> come
>>>>>>>
>>>>>>>> up with a roadmap and start implementing.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Theodore
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
>>>>>>>>>> tamps-td10241.html
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
>>>>>>>>>>
>>>>>>>>>> mail@gaborhermann.com
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>>
>>>>>>>> It's great to see so much activity in this discussion :)
>>>>>>>>>>
>>>>>>>>>> I'll try to add my thoughts.
>>>>>>>>>>>
>>>>>>>>>>> I think building a developer community (Till's 2. point) can be
>>>>>>>>>>>
>>>>>>>>>>> slightly
>>>>>>>>>>
>>>>>>>>> separated from what features we should aim for (1. point) and
>>>>>>>>
>>>>>>>>> showcasing
>>>>>>>>>>
>>>>>>>>> (3. point). Thanks Till for bringing up the ideas for
>>>>>>>>
>>>>>>>>> restructuring,
>>>>>>>>>>
>>>>>>>>> I'm
>>>>>>
>>>>>>> sure we'll find a way to make the development process more
>>>>>>>>
>>>>>>>>> dynamic.
>>>>>>>>>>
>>>>>>>>> I'll
>>>>>>
>>>>>>> try to address the rest here.
>>>>>>>>
>>>>>>>>> It's hard to choose directions between streaming and batch ML. As
>>>>>>>>>>>
>>>>>>>>>>> Theo
>>>>>>>>>>
>>>>>>>>> has
>>>>>>>
>>>>>>>> indicated, not much online ML is used in production, but Flink
>>>>>>>>>>> concentrates
>>>>>>>>>>> on streaming, so online ML would be a better fit for Flink.
>>>>>>>>>>>
>>>>>>>>>>> However,
>>>>>>>>>>
>>>>>>>>> as
>>>>>>
>>>>>>> most of you argued, there's definite need for batch ML. But batch
>>>>>>>>
>>>>>>>>> ML
>>>>>>>>>>
>>>>>>>>> seems
>>>>>>
>>>>>>> hard to achieve because there are blocking issues with persisting,
>>>>>>>>>>> iteration paths etc. So it's no good either way.
>>>>>>>>>>>
>>>>>>>>>>> I propose a seemingly crazy solution: what if we developed batch
>>>>>>>>>>> algorithms also with the streaming API? The batch API would
>>>>>>>>>>>
>>>>>>>>>>> clearly
>>>>>>>>>>
>>>>>>>>> seem
>>>>>>
>>>>>>> more suitable for ML algorithms, but there a lot of benefits of
>>>>>>>>
>>>>>>>>> this
>>>>>>>>>>
>>>>>>>>> approach too, so it's clearly worth considering. Flink also has
>>>>>>
>>>>>>> the
>>>>>>>>>>
>>>>>>>>> high
>>>>>>
>>>>>>> level vision of "streaming for everything" that would clearly fit
>>>>>>>>
>>>>>>>>> this
>>>>>>>>>>
>>>>>>>>> case. What do you all think about this? Do you think this solution
>>>>>>>
>>>>>>>> would
>>>>>>>>>>
>>>>>>>>> be
>>>>>>>>
>>>>>>>>> feasible? I would be happy to make a more elaborate proposal, but
>>>>>>>>>>>
>>>>>>>>>>> I
>>>>>>>>>>
>>>>>>>>> push
>>>>>>
>>>>>>> my
>>>>>>>>
>>>>>>>>> main ideas here:
>>>>>>>>>>>
>>>>>>>>>>> 1) Simplifying by using one system
>>>>>>>>>>> It could simplify the work of both the users and the developers.
>>>>>>>>>>>
>>>>>>>>>>> One
>>>>>>>>>>
>>>>>>>>> could
>>>>>>
>>>>>>> execute training once, or could execute it periodically e.g. by
>>>>>>>>>>>
>>>>>>>>>>> using
>>>>>>>>>>
>>>>>>>>> windows. Low-latency serving and training could be done in the
>>>>>>>
>>>>>>>> same
>>>>>>>>>>
>>>>>>>>> system.
>>>>>>
>>>>>>> We could implement incremental algorithms, without any side inputs
>>>>>>>>>>>
>>>>>>>>>>> for
>>>>>>>>>>
>>>>>>>>> combining online learning (or predictions) with batch learning. Of
>>>>>>>
>>>>>>>> course,
>>>>>>>>>>> all the logic describing these must be somehow implemented (e.g.
>>>>>>>>>>> synchronizing predictions with training), but it should be easier
>>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>> do
>>>>>>
>>>>>>> so
>>>>>>>>
>>>>>>>>> in one system, than by combining e.g. the batch and streaming API.
>>>>>>>>>>>
>>>>>>>>>>> 2) Batch ML with the streaming API is not harder
>>>>>>>>>>> Despite these benefits, it could seem harder to implement batch
>>>>>>>>>>> ML
>>>>>>>>>>>
>>>>>>>>>>> with
>>>>>>>>>>
>>>>>>>>> the streaming API, but in my opinion it's not. There are more
>>>>>>>>
>>>>>>>>> flexible,
>>>>>>>>>>
>>>>>>>>> lower-level optimization potentials with the streaming API. Most
>>>>>>>>
>>>>>>>>> distributed ML algorithms use a lower-level model than the batch
>>>>>>>>>>>
>>>>>>>>>>> API
>>>>>>>>>>
>>>>>>>>> anyway, so sometimes it feels like forcing the algorithm logic
>>>>>>
>>>>>>> into
>>>>>>>>>>
>>>>>>>>> the
>>>>>>
>>>>>>> training API and tweaking it. Although we could not use the batch
>>>>>>>>
>>>>>>>>> primitives like join, we would have the E.g. in my experience with
>>>>>>>>>>> implementing a distributed matrix factorization algorithm [1], I
>>>>>>>>>>>
>>>>>>>>>>> couldn't
>>>>>>>>>>
>>>>>>>>> do a simple optimization because of the limitations of the
>>>>>>>>
>>>>>>>>> iteration
>>>>>>>>>>
>>>>>>>>> API
>>>>>>
>>>>>>> [2]. Even if we pushed all the development effort to make the
>>>>>>>>
>>>>>>>>> batch
>>>>>>>>>>
>>>>>>>>> API
>>>>>>
>>>>>>> more suitable for ML there would be things we couldn't do. E.g.
>>>>>>>>
>>>>>>>>> there
>>>>>>>>>>
>>>>>>>>> are
>>>>>>>
>>>>>>> approaches for updating a model iteratively without locks [3,4]
>>>>>>>>
>>>>>>>>> (i.e.
>>>>>>>>>>
>>>>>>>>> somewhat asynchronously), and I don't see a clear way to implement
>>>>>>>
>>>>>>>> such
>>>>>>>>>>
>>>>>>>>> algorithms with the batch API.
>>>>>>>>
>>>>>>>>> 3) Streaming community (users and devs) benefit
>>>>>>>>>>> The Flink streaming community in general would also benefit from
>>>>>>>>>>>
>>>>>>>>>>> this
>>>>>>>>>>
>>>>>>>>> direction. There are many features needed in the streaming API for
>>>>>>>
>>>>>>>> ML
>>>>>>>>>>
>>>>>>>>> to
>>>>>>>
>>>>>>> work, but this is also true for the batch API. One really
>>>>>>>>
>>>>>>>>> important
>>>>>>>>>>
>>>>>>>>> is
>>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot
>>>>>>>>>>>
>>>>>>>>>>> of
>>>>>>>>>>
>>>>>>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
>>>>>>>
>>>>>>>> mentioned
>>>>>>>>>>> using GPUs, and I'm sure they have uses in streaming generally
>>>>>>>>>>>
>>>>>>>>>>> [7].
>>>>>>>>>>
>>>>>>>>> Thus,
>>>>>>
>>>>>>> by improving the streaming API to allow ML algorithms, the
>>>>>>>>
>>>>>>>>> streaming
>>>>>>>>>>
>>>>>>>>> API
>>>>>>
>>>>>>> benefit too (which is important as they have a lot more production
>>>>>>>>
>>>>>>>>> users
>>>>>>>>>>
>>>>>>>>> than the batch API).
>>>>>>>>
>>>>>>>>> 4) Performance can be at least as good
>>>>>>>>>>> I believe the same performance could be achieved with the
>>>>>>>>>>>
>>>>>>>>>>> streaming
>>>>>>>>>>
>>>>>>>>> API
>>>>>>
>>>>>>> as
>>>>>>>>
>>>>>>>>> with the batch API. Streaming API is much closer to the runtime
>>>>>>>>>>>
>>>>>>>>>>> than
>>>>>>>>>>
>>>>>>>>> the
>>>>>>
>>>>>>> batch API. For corner-cases, with runtime-layer optimizations of
>>>>>>>>
>>>>>>>>> batch
>>>>>>>>>>
>>>>>>>>> API,
>>>>>>>
>>>>>>>> we could find a way to do the same (or similar) optimization for
>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>> streaming API (see my previous point). Such case could be using
>>>>>>
>>>>>>> managed
>>>>>>>>>>
>>>>>>>>> memory (and spilling to disk). There are also benefits by default,
>>>>>>>>
>>>>>>>>> e.g.
>>>>>>>>>>
>>>>>>>>> we
>>>>>>>>
>>>>>>>>> would have a finer grained fault tolerance with the streaming API.
>>>>>>>>>>>
>>>>>>>>>>> 5) We could keep batch ML API
>>>>>>>>>>> For the shorter term, we should not throw away all the algorithms
>>>>>>>>>>> implemented with the batch API. By pushing forward the
>>>>>>>>>>> development
>>>>>>>>>>>
>>>>>>>>>>> with
>>>>>>>>>>
>>>>>>>>> side inputs we could make them usable with streaming API. Then, if
>>>>>>>>
>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>> library gains some popularity, we could replace the algorithms in
>>>>>>>
>>>>>>>> the
>>>>>>>>>>
>>>>>>>>> batch
>>>>>>>
>>>>>>>> API with streaming ones, to avoid the performance costs of e.g.
>>>>>>>>>>>
>>>>>>>>>>> not
>>>>>>>>>>
>>>>>>>>> being
>>>>>>
>>>>>>> able to persist.
>>>>>>>>
>>>>>>>>> 6) General tools for implementing ML algorithms
>>>>>>>>>>> Besides implementing algorithms one by one, we could give more
>>>>>>>>>>>
>>>>>>>>>>> general
>>>>>>>>>>
>>>>>>>>> tools for making it easier to implement algorithms. E.g. parameter
>>>>>>>
>>>>>>>> server
>>>>>>>>>>
>>>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
>>>>>>>>
>>>>>>>>> similar
>>>>>>>>>>> model to Flink streaming, we could look into that too. I think
>>>>>>>>>>>
>>>>>>>>>>> often
>>>>>>>>>>
>>>>>>>>> when
>>>>>>
>>>>>>> deploying a production ML system, much more configuration and
>>>>>>>>
>>>>>>>>> tweaking
>>>>>>>>>>
>>>>>>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>>>>>>
>>>>>>>> 7) Showcasing
>>>>>>>>>>> Showcasing this could be easier. We could say that we're doing
>>>>>>>>>>>
>>>>>>>>>>> batch
>>>>>>>>>>
>>>>>>>>> ML
>>>>>>
>>>>>>> with a streaming API. That's interesting in its own. IMHO this
>>>>>>>>
>>>>>>>>> integration
>>>>>>>>>>> is also a more approachable way towards end-to-end ML.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks for reading so far :)
>>>>>>>>>>>
>>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>>>>>>>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pd
>>>>>>>>>>> f
>>>>>>>>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>>>>>>>>>> 13-final77.pdf
>>>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>>>>>>>>>> Scoped+Loops+and+Job+Termination
>>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
>>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.
>>>>>>>>>>>
>>>>>>>>>>> pdf
>>>>>>>>>>
>>>>>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>>>>>
>>>>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>>>>>>>>>> Parameter-Server-implementation-td15880.html
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Gabor
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>> *Yours faithfully, *
>>>>>>
>>>>>> *Kate Eri.*
>>>>>>
>>>>>>
>>>>>>
>

Re: [DISCUSS] Flink ML roadmap

Posted by Gábor Hermann <ma...@gaborhermann.com>.

Okay, let's just aim for around the end of next week, but we can take 
more time to discuss if there's still a lot of ongoing activity. Keep 
the topic hot!

Thanks all for the enthusiasm :)


On 2017-02-23 16:17, Stavros Kontopoulos wrote:
> @Gabor 3rd March is ok for me. But maybe giving a bit more time to it like
> a week may suit more people.
> What do you think all?
> I will contribute to the doc.
>
> +100 for having a co-ordinator + commiter.
>
> Thank you all for joining the discussion.
>
> Cheers,
> Stavros
>
> On Thu, Feb 23, 2017 at 4:48 PM, G�bor Hermann <ma...@gaborhermann.com>
> wrote:
>
>> Okay, I've created a skeleton of the design doc for choosing a direction:
>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
>> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
>>
>> Much of the pros/cons have already been discussed here, so I'll try to put
>> there all the arguments mentioned in this thread. Feel free to put there
>> more :)
>>
>> @Stavros: I agree we should take action fast. What about collecting our
>> thoughts in the doc by around Tuesday next week (28. February)? Then decide
>> on the direction and design a roadmap by around Friday (3. March)? Is that
>> feasible, or should it take more time?
>>
>> I think it will be necessary to have a shepherd, or even better a
>> committer, to be involved in at least reviewing and accepting the roadmap.
>> It would be best, if a committer coordinated all this.
>> @Theodore: Would you like to do the coordination?
>>
>> Regarding the use-cases: I've seen some abstracts of talks at SF Flink
>> Forward [1] that seem promising. There are companies already using Flink
>> for ML [2,3,4,5].
>>
>> [1] http://sf.flink-forward.org/program/sessions/
>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
>> eaming-vs-micro-batch-for-online-learning/
>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/
>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
>> arning-on-flink/
>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
>> ing-scenarios-with-flink/
>>
>> Cheers,
>> Gabor
>>
>>
>>
>> On 2017-02-23 15:19, Katherin Eri wrote:
>>
>>> I have asked already some teams for useful cases, but all of them need
>>> time
>>> to think.
>>> During analysis something will finally arise.
>>> May be we can ask partners of Flink  for cases? Data Artisans got results
>>> of customers survey: [1], ML better support is wanted, so we could ask
>>> what
>>> exactly is necessary.
>>>
>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
>>>
>>> 23 \u0444\u0435\u0432\u0440. 2017 \u0433. 4:32 PM \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c "Stavros Kontopoulos" <
>>> st.kontopoulos@gmail.com> \u043d\u0430\u043f\u0438\u0441\u0430\u043b:
>>>
>>> +100 for a design doc.
>>>> Could we also set a roadmap after some time-boxed investigation captured
>>>> in
>>>> that document? We need action.
>>>>
>>>> Looking forward to work on this (whatever that might be) ;) Also are
>>>> there
>>>> any data supporting one direction or the other from a customer
>>>> perspective?
>>>> It would help to make more informed decisions.
>>>>
>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <ka...@gmail.com>
>>>> wrote:
>>>>
>>>> Yes, ok.
>>>>> let's start some design document, and write down there already mentioned
>>>>> ideas about: parameter server, about clipper and others. Would be nice
>>>>> if
>>>>> we will also map this approaches to cases.
>>>>> Will work on it collaboratively on each topic, may be finally we will
>>>>>
>>>> form
>>>>
>>>>> some picture, that could be agreed with committers.
>>>>> @Gabor, could you please start such shared doc, as you have already
>>>>>
>>>> several
>>>>
>>>>> ideas proposed?
>>>>>
>>>>> \u0447\u0442, 23 \u0444\u0435\u0432\u0440. 2017, 15:06 G�bor Hermann <ma...@gaborhermann.com>:
>>>>>
>>>>> I agree, that it's better to go in one direction first, but I think
>>>>>> online and offline with streaming API can go somewhat parallel later.
>>>>>>
>>>>> We
>>>>> could set a short-term goal, concentrate initially on one direction,
>>>>> and
>>>>> showcase that direction (e.g. in a blogpost). But first, we should list
>>>>>> the pros/cons in a design doc as a minimum. Then make a decision what
>>>>>> direction to go. Would that be feasible?
>>>>>>
>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
>>>>>>
>>>>>> I'm not sure that this is feasible, doing all at the same time could
>>>>>> mean
>>>>>> doing nothing((((
>>>>>>> I'm just afraid, that words: we will work on streaming not on
>>>>>>>
>>>>>> batching,
>>>>> we
>>>>>>> have no commiter's time for this, mean that yes, we started work on
>>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
>>>>>>>
>>>>>> already
>>>>> was
>>>>>>> with this ticket.
>>>>>>>
>>>>>>> 23 \u0444\u0435\u0432\u0440. 2017 \u0433. 14:26 \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c "G�bor Hermann" <
>>>>>>>
>>>>>> mail@gaborhermann.com>
>>>>>>
>>>>>>> \u043d\u0430\u043f\u0438\u0441\u0430\u043b:
>>>>>>>
>>>>>>> @Theodore: Great to hear you think the "batch on streaming" approach
>>>>>>> is
>>>>>> possible! Of course, we need to pay attention all the pitfalls
>>>>>>> there,
>>>>> if we
>>>>>>> go that way.
>>>>>>>> +1 for a design doc!
>>>>>>>>
>>>>>>>> I would add that it's possible to make efforts in all the three
>>>>>>>>
>>>>>>> directions
>>>>>>> (i.e. batch, online, batch on streaming) at the same time. Although,
>>>>>>> it
>>>>>> might be worth to concentrate on one. E.g. it would not be so useful
>>>>>>> to
>>>>>> have the same batch algorithms with both the batch API and streaming
>>>>>>> API.
>>>>>>> We can decide later.
>>>>>>>> The design doc could be partitioned to these 3 directions, and we
>>>>>>>>
>>>>>>> can
>>>>> collect there the pros/cons too. What do you think?
>>>>>>>> Cheers,
>>>>>>>> Gabor
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>>>>>>>>
>>>>>>>> Hello all,
>>>>>>>>>
>>>>>>>>> @Gabor, we have discussed the idea of using the streaming API to
>>>>>>>>>
>>>>>>>> write
>>>>>> all
>>>>>>
>>>>>>> of our ML algorithms with a couple of people offline,
>>>>>>>>> and I think it might be possible and is generally worth a shot. The
>>>>>>>>> approach we would take would be close to Vowpal Wabbit, not exactly
>>>>>>>>> "online", but rather "fast-batch".
>>>>>>>>>
>>>>>>>>> There will be problems popping up again, even for very simple algos
>>>>>>>>>
>>>>>>>> like
>>>>>>> on
>>>>>>>>> line linear regression with SGD [1], but hopefully fixing those
>>>>>>>>>
>>>>>>>> will
>>>>> be
>>>>>
>>>>>> more aligned with the priorities of the community.
>>>>>>>>> @Katherin, my understanding is that given the limited resources,
>>>>>>>>>
>>>>>>>> there
>>>>>> is
>>>>>>
>>>>>>> no development effort focused on batch processing right now.
>>>>>>>>> So to summarize, it seems like there are people willing to work on
>>>>>>>>>
>>>>>>>> ML
>>>>> on
>>>>>>> Flink, but nobody is sure how to do it.
>>>>>>>>> There are many directions we could take (batch, online, batch on
>>>>>>>>> streaming), each with its own merits and downsides.
>>>>>>>>>
>>>>>>>>> If you want we can start a design doc and move the conversation
>>>>>>>>>
>>>>>>>> there,
>>>>>> come
>>>>>>>>> up with a roadmap and start implementing.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Theodore
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
>>>>>>>>> tamps-td10241.html
>>>>>>>>>
>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, G�bor Hermann <
>>>>>>>>>
>>>>>>>> mail@gaborhermann.com
>>>>>> wrote:
>>>>>>>>> It's great to see so much activity in this discussion :)
>>>>>>>>>
>>>>>>>>>> I'll try to add my thoughts.
>>>>>>>>>>
>>>>>>>>>> I think building a developer community (Till's 2. point) can be
>>>>>>>>>>
>>>>>>>>> slightly
>>>>>>> separated from what features we should aim for (1. point) and
>>>>>>>>> showcasing
>>>>>>> (3. point). Thanks Till for bringing up the ideas for
>>>>>>>>> restructuring,
>>>>> I'm
>>>>>>> sure we'll find a way to make the development process more
>>>>>>>>> dynamic.
>>>>> I'll
>>>>>>> try to address the rest here.
>>>>>>>>>> It's hard to choose directions between streaming and batch ML. As
>>>>>>>>>>
>>>>>>>>> Theo
>>>>>> has
>>>>>>>>>> indicated, not much online ML is used in production, but Flink
>>>>>>>>>> concentrates
>>>>>>>>>> on streaming, so online ML would be a better fit for Flink.
>>>>>>>>>>
>>>>>>>>> However,
>>>>> as
>>>>>>> most of you argued, there's definite need for batch ML. But batch
>>>>>>>>> ML
>>>>> seems
>>>>>>>>>> hard to achieve because there are blocking issues with persisting,
>>>>>>>>>> iteration paths etc. So it's no good either way.
>>>>>>>>>>
>>>>>>>>>> I propose a seemingly crazy solution: what if we developed batch
>>>>>>>>>> algorithms also with the streaming API? The batch API would
>>>>>>>>>>
>>>>>>>>> clearly
>>>>> seem
>>>>>>> more suitable for ML algorithms, but there a lot of benefits of
>>>>>>>>> this
>>>>> approach too, so it's clearly worth considering. Flink also has
>>>>>>>>> the
>>>>> high
>>>>>>> level vision of "streaming for everything" that would clearly fit
>>>>>>>>> this
>>>>>> case. What do you all think about this? Do you think this solution
>>>>>>>>> would
>>>>>>> be
>>>>>>>>>> feasible? I would be happy to make a more elaborate proposal, but
>>>>>>>>>>
>>>>>>>>> I
>>>>> push
>>>>>>> my
>>>>>>>>>> main ideas here:
>>>>>>>>>>
>>>>>>>>>> 1) Simplifying by using one system
>>>>>>>>>> It could simplify the work of both the users and the developers.
>>>>>>>>>>
>>>>>>>>> One
>>>>> could
>>>>>>>>>> execute training once, or could execute it periodically e.g. by
>>>>>>>>>>
>>>>>>>>> using
>>>>>> windows. Low-latency serving and training could be done in the
>>>>>>>>> same
>>>>> system.
>>>>>>>>>> We could implement incremental algorithms, without any side inputs
>>>>>>>>>>
>>>>>>>>> for
>>>>>> combining online learning (or predictions) with batch learning. Of
>>>>>>>>>> course,
>>>>>>>>>> all the logic describing these must be somehow implemented (e.g.
>>>>>>>>>> synchronizing predictions with training), but it should be easier
>>>>>>>>>>
>>>>>>>>> to
>>>>> do
>>>>>>> so
>>>>>>>>>> in one system, than by combining e.g. the batch and streaming API.
>>>>>>>>>>
>>>>>>>>>> 2) Batch ML with the streaming API is not harder
>>>>>>>>>> Despite these benefits, it could seem harder to implement batch ML
>>>>>>>>>>
>>>>>>>>> with
>>>>>>> the streaming API, but in my opinion it's not. There are more
>>>>>>>>> flexible,
>>>>>>> lower-level optimization potentials with the streaming API. Most
>>>>>>>>>> distributed ML algorithms use a lower-level model than the batch
>>>>>>>>>>
>>>>>>>>> API
>>>>> anyway, so sometimes it feels like forcing the algorithm logic
>>>>>>>>> into
>>>>> the
>>>>>>> training API and tweaking it. Although we could not use the batch
>>>>>>>>>> primitives like join, we would have the E.g. in my experience with
>>>>>>>>>> implementing a distributed matrix factorization algorithm [1], I
>>>>>>>>>>
>>>>>>>>> couldn't
>>>>>>> do a simple optimization because of the limitations of the
>>>>>>>>> iteration
>>>>> API
>>>>>>> [2]. Even if we pushed all the development effort to make the
>>>>>>>>> batch
>>>>> API
>>>>>>> more suitable for ML there would be things we couldn't do. E.g.
>>>>>>>>> there
>>>>>> are
>>>>>>
>>>>>>> approaches for updating a model iteratively without locks [3,4]
>>>>>>>>> (i.e.
>>>>>> somewhat asynchronously), and I don't see a clear way to implement
>>>>>>>>> such
>>>>>>> algorithms with the batch API.
>>>>>>>>>> 3) Streaming community (users and devs) benefit
>>>>>>>>>> The Flink streaming community in general would also benefit from
>>>>>>>>>>
>>>>>>>>> this
>>>>>> direction. There are many features needed in the streaming API for
>>>>>>>>> ML
>>>>>> to
>>>>>>
>>>>>>> work, but this is also true for the batch API. One really
>>>>>>>>> important
>>>>> is
>>>>>
>>>>>> the
>>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot
>>>>>>>>>>
>>>>>>>>> of
>>>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
>>>>>>>>>> mentioned
>>>>>>>>>> using GPUs, and I'm sure they have uses in streaming generally
>>>>>>>>>>
>>>>>>>>> [7].
>>>>> Thus,
>>>>>>> by improving the streaming API to allow ML algorithms, the
>>>>>>>>> streaming
>>>>> API
>>>>>>> benefit too (which is important as they have a lot more production
>>>>>>>>> users
>>>>>>> than the batch API).
>>>>>>>>>> 4) Performance can be at least as good
>>>>>>>>>> I believe the same performance could be achieved with the
>>>>>>>>>>
>>>>>>>>> streaming
>>>>> API
>>>>>>> as
>>>>>>>>>> with the batch API. Streaming API is much closer to the runtime
>>>>>>>>>>
>>>>>>>>> than
>>>>> the
>>>>>>> batch API. For corner-cases, with runtime-layer optimizations of
>>>>>>>>> batch
>>>>>> API,
>>>>>>>>>> we could find a way to do the same (or similar) optimization for
>>>>>>>>>>
>>>>>>>>> the
>>>>> streaming API (see my previous point). Such case could be using
>>>>>>>>> managed
>>>>>>> memory (and spilling to disk). There are also benefits by default,
>>>>>>>>> e.g.
>>>>>>> we
>>>>>>>>>> would have a finer grained fault tolerance with the streaming API.
>>>>>>>>>>
>>>>>>>>>> 5) We could keep batch ML API
>>>>>>>>>> For the shorter term, we should not throw away all the algorithms
>>>>>>>>>> implemented with the batch API. By pushing forward the development
>>>>>>>>>>
>>>>>>>>> with
>>>>>>> side inputs we could make them usable with streaming API. Then, if
>>>>>>>>> the
>>>>>> library gains some popularity, we could replace the algorithms in
>>>>>>>>> the
>>>>>> batch
>>>>>>>>>> API with streaming ones, to avoid the performance costs of e.g.
>>>>>>>>>>
>>>>>>>>> not
>>>>> being
>>>>>>> able to persist.
>>>>>>>>>> 6) General tools for implementing ML algorithms
>>>>>>>>>> Besides implementing algorithms one by one, we could give more
>>>>>>>>>>
>>>>>>>>> general
>>>>>> tools for making it easier to implement algorithms. E.g. parameter
>>>>>>>>> server
>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
>>>>>>>>>> similar
>>>>>>>>>> model to Flink streaming, we could look into that too. I think
>>>>>>>>>>
>>>>>>>>> often
>>>>> when
>>>>>>> deploying a production ML system, much more configuration and
>>>>>>>>> tweaking
>>>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>>>>>>>>> 7) Showcasing
>>>>>>>>>> Showcasing this could be easier. We could say that we're doing
>>>>>>>>>>
>>>>>>>>> batch
>>>>> ML
>>>>>>> with a streaming API. That's interesting in its own. IMHO this
>>>>>>>>>> integration
>>>>>>>>>> is also a more approachable way towards end-to-end ML.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks for reading so far :)
>>>>>>>>>>
>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>>>>>>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
>>>>>>>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>>>>>>>>> 13-final77.pdf
>>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>>>>>>>>> Scoped+Loops+and+Job+Termination
>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.
>>>>>>>>>>
>>>>>>>>> pdf
>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>>>>>>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>>>>>>>>> Parameter-Server-implementation-td15880.html
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Gabor
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>> *Yours faithfully, *
>>>>>
>>>>> *Kate Eri.*
>>>>>
>>>>>

Re: [DISCUSS] Flink ML roadmap

Posted by Stavros Kontopoulos <st...@gmail.com>.

@Gabor 3rd March is ok for me. But maybe giving a bit more time to it like
a week may suit more people.
What do you think all?
I will contribute to the doc.

+100 for having a co-ordinator + commiter.

Thank you all for joining the discussion.

Cheers,
Stavros

On Thu, Feb 23, 2017 at 4:48 PM, Gábor Hermann <ma...@gaborhermann.com>
wrote:

> Okay, I've created a skeleton of the design doc for choosing a direction:
> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
>
> Much of the pros/cons have already been discussed here, so I'll try to put
> there all the arguments mentioned in this thread. Feel free to put there
> more :)
>
> @Stavros: I agree we should take action fast. What about collecting our
> thoughts in the doc by around Tuesday next week (28. February)? Then decide
> on the direction and design a roadmap by around Friday (3. March)? Is that
> feasible, or should it take more time?
>
> I think it will be necessary to have a shepherd, or even better a
> committer, to be involved in at least reviewing and accepting the roadmap.
> It would be best, if a committer coordinated all this.
> @Theodore: Would you like to do the coordination?
>
> Regarding the use-cases: I've seen some abstracts of talks at SF Flink
> Forward [1] that seem promising. There are companies already using Flink
> for ML [2,3,4,5].
>
> [1] http://sf.flink-forward.org/program/sessions/
> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
> eaming-vs-micro-batch-for-online-learning/
> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/
> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
> arning-on-flink/
> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
> ing-scenarios-with-flink/
>
> Cheers,
> Gabor
>
>
>
> On 2017-02-23 15:19, Katherin Eri wrote:
>
>> I have asked already some teams for useful cases, but all of them need
>> time
>> to think.
>> During analysis something will finally arise.
>> May be we can ask partners of Flink  for cases? Data Artisans got results
>> of customers survey: [1], ML better support is wanted, so we could ask
>> what
>> exactly is necessary.
>>
>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
>>
>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
>> st.kontopoulos@gmail.com> написал:
>>
>> +100 for a design doc.
>>>
>>> Could we also set a roadmap after some time-boxed investigation captured
>>> in
>>> that document? We need action.
>>>
>>> Looking forward to work on this (whatever that might be) ;) Also are
>>> there
>>> any data supporting one direction or the other from a customer
>>> perspective?
>>> It would help to make more informed decisions.
>>>
>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <ka...@gmail.com>
>>> wrote:
>>>
>>> Yes, ok.
>>>> let's start some design document, and write down there already mentioned
>>>> ideas about: parameter server, about clipper and others. Would be nice
>>>> if
>>>> we will also map this approaches to cases.
>>>> Will work on it collaboratively on each topic, may be finally we will
>>>>
>>> form
>>>
>>>> some picture, that could be agreed with committers.
>>>> @Gabor, could you please start such shared doc, as you have already
>>>>
>>> several
>>>
>>>> ideas proposed?
>>>>
>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <ma...@gaborhermann.com>:
>>>>
>>>> I agree, that it's better to go in one direction first, but I think
>>>>> online and offline with streaming API can go somewhat parallel later.
>>>>>
>>>> We
>>>
>>>> could set a short-term goal, concentrate initially on one direction,
>>>>>
>>>> and
>>>
>>>> showcase that direction (e.g. in a blogpost). But first, we should list
>>>>> the pros/cons in a design doc as a minimum. Then make a decision what
>>>>> direction to go. Would that be feasible?
>>>>>
>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
>>>>>
>>>>> I'm not sure that this is feasible, doing all at the same time could
>>>>>>
>>>>> mean
>>>>
>>>>> doing nothing((((
>>>>>> I'm just afraid, that words: we will work on streaming not on
>>>>>>
>>>>> batching,
>>>
>>>> we
>>>>>
>>>>>> have no commiter's time for this, mean that yes, we started work on
>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
>>>>>>
>>>>> already
>>>
>>>> was
>>>>>
>>>>>> with this ticket.
>>>>>>
>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
>>>>>>
>>>>> mail@gaborhermann.com>
>>>>>
>>>>>> написал:
>>>>>>
>>>>>> @Theodore: Great to hear you think the "batch on streaming" approach
>>>>>>>
>>>>>> is
>>>>
>>>>> possible! Of course, we need to pay attention all the pitfalls
>>>>>>>
>>>>>> there,
>>>
>>>> if we
>>>>>
>>>>>> go that way.
>>>>>>>
>>>>>>> +1 for a design doc!
>>>>>>>
>>>>>>> I would add that it's possible to make efforts in all the three
>>>>>>>
>>>>>> directions
>>>>>
>>>>>> (i.e. batch, online, batch on streaming) at the same time. Although,
>>>>>>>
>>>>>> it
>>>>
>>>>> might be worth to concentrate on one. E.g. it would not be so useful
>>>>>>>
>>>>>> to
>>>>
>>>>> have the same batch algorithms with both the batch API and streaming
>>>>>>>
>>>>>> API.
>>>>>
>>>>>> We can decide later.
>>>>>>>
>>>>>>> The design doc could be partitioned to these 3 directions, and we
>>>>>>>
>>>>>> can
>>>
>>>> collect there the pros/cons too. What do you think?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Gabor
>>>>>>>
>>>>>>>
>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>>>>>>>
>>>>>>> Hello all,
>>>>>>>>
>>>>>>>>
>>>>>>>> @Gabor, we have discussed the idea of using the streaming API to
>>>>>>>>
>>>>>>> write
>>>>
>>>>> all
>>>>>
>>>>>> of our ML algorithms with a couple of people offline,
>>>>>>>> and I think it might be possible and is generally worth a shot. The
>>>>>>>> approach we would take would be close to Vowpal Wabbit, not exactly
>>>>>>>> "online", but rather "fast-batch".
>>>>>>>>
>>>>>>>> There will be problems popping up again, even for very simple algos
>>>>>>>>
>>>>>>> like
>>>>>
>>>>>> on
>>>>>>>> line linear regression with SGD [1], but hopefully fixing those
>>>>>>>>
>>>>>>> will
>>>
>>>> be
>>>>
>>>>> more aligned with the priorities of the community.
>>>>>>>>
>>>>>>>> @Katherin, my understanding is that given the limited resources,
>>>>>>>>
>>>>>>> there
>>>>
>>>>> is
>>>>>
>>>>>> no development effort focused on batch processing right now.
>>>>>>>>
>>>>>>>> So to summarize, it seems like there are people willing to work on
>>>>>>>>
>>>>>>> ML
>>>
>>>> on
>>>>>
>>>>>> Flink, but nobody is sure how to do it.
>>>>>>>> There are many directions we could take (batch, online, batch on
>>>>>>>> streaming), each with its own merits and downsides.
>>>>>>>>
>>>>>>>> If you want we can start a design doc and move the conversation
>>>>>>>>
>>>>>>> there,
>>>>
>>>>> come
>>>>>>>> up with a roadmap and start implementing.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Theodore
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
>>>>>>>> tamps-td10241.html
>>>>>>>>
>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
>>>>>>>>
>>>>>>> mail@gaborhermann.com
>>>>
>>>>> wrote:
>>>>>>>>
>>>>>>>> It's great to see so much activity in this discussion :)
>>>>>>>>
>>>>>>>>> I'll try to add my thoughts.
>>>>>>>>>
>>>>>>>>> I think building a developer community (Till's 2. point) can be
>>>>>>>>>
>>>>>>>> slightly
>>>>>
>>>>>> separated from what features we should aim for (1. point) and
>>>>>>>>>
>>>>>>>> showcasing
>>>>>
>>>>>> (3. point). Thanks Till for bringing up the ideas for
>>>>>>>>>
>>>>>>>> restructuring,
>>>
>>>> I'm
>>>>>
>>>>>> sure we'll find a way to make the development process more
>>>>>>>>>
>>>>>>>> dynamic.
>>>
>>>> I'll
>>>>>
>>>>>> try to address the rest here.
>>>>>>>>>
>>>>>>>>> It's hard to choose directions between streaming and batch ML. As
>>>>>>>>>
>>>>>>>> Theo
>>>>
>>>>> has
>>>>>>>>> indicated, not much online ML is used in production, but Flink
>>>>>>>>> concentrates
>>>>>>>>> on streaming, so online ML would be a better fit for Flink.
>>>>>>>>>
>>>>>>>> However,
>>>
>>>> as
>>>>>
>>>>>> most of you argued, there's definite need for batch ML. But batch
>>>>>>>>>
>>>>>>>> ML
>>>
>>>> seems
>>>>>>>>> hard to achieve because there are blocking issues with persisting,
>>>>>>>>> iteration paths etc. So it's no good either way.
>>>>>>>>>
>>>>>>>>> I propose a seemingly crazy solution: what if we developed batch
>>>>>>>>> algorithms also with the streaming API? The batch API would
>>>>>>>>>
>>>>>>>> clearly
>>>
>>>> seem
>>>>>
>>>>>> more suitable for ML algorithms, but there a lot of benefits of
>>>>>>>>>
>>>>>>>> this
>>>
>>>> approach too, so it's clearly worth considering. Flink also has
>>>>>>>>>
>>>>>>>> the
>>>
>>>> high
>>>>>
>>>>>> level vision of "streaming for everything" that would clearly fit
>>>>>>>>>
>>>>>>>> this
>>>>
>>>>> case. What do you all think about this? Do you think this solution
>>>>>>>>>
>>>>>>>> would
>>>>>
>>>>>> be
>>>>>>>>> feasible? I would be happy to make a more elaborate proposal, but
>>>>>>>>>
>>>>>>>> I
>>>
>>>> push
>>>>>
>>>>>> my
>>>>>>>>> main ideas here:
>>>>>>>>>
>>>>>>>>> 1) Simplifying by using one system
>>>>>>>>> It could simplify the work of both the users and the developers.
>>>>>>>>>
>>>>>>>> One
>>>
>>>> could
>>>>>>>>> execute training once, or could execute it periodically e.g. by
>>>>>>>>>
>>>>>>>> using
>>>>
>>>>> windows. Low-latency serving and training could be done in the
>>>>>>>>>
>>>>>>>> same
>>>
>>>> system.
>>>>>>>>> We could implement incremental algorithms, without any side inputs
>>>>>>>>>
>>>>>>>> for
>>>>
>>>>> combining online learning (or predictions) with batch learning. Of
>>>>>>>>> course,
>>>>>>>>> all the logic describing these must be somehow implemented (e.g.
>>>>>>>>> synchronizing predictions with training), but it should be easier
>>>>>>>>>
>>>>>>>> to
>>>
>>>> do
>>>>>
>>>>>> so
>>>>>>>>> in one system, than by combining e.g. the batch and streaming API.
>>>>>>>>>
>>>>>>>>> 2) Batch ML with the streaming API is not harder
>>>>>>>>> Despite these benefits, it could seem harder to implement batch ML
>>>>>>>>>
>>>>>>>> with
>>>>>
>>>>>> the streaming API, but in my opinion it's not. There are more
>>>>>>>>>
>>>>>>>> flexible,
>>>>>
>>>>>> lower-level optimization potentials with the streaming API. Most
>>>>>>>>> distributed ML algorithms use a lower-level model than the batch
>>>>>>>>>
>>>>>>>> API
>>>
>>>> anyway, so sometimes it feels like forcing the algorithm logic
>>>>>>>>>
>>>>>>>> into
>>>
>>>> the
>>>>>
>>>>>> training API and tweaking it. Although we could not use the batch
>>>>>>>>> primitives like join, we would have the E.g. in my experience with
>>>>>>>>> implementing a distributed matrix factorization algorithm [1], I
>>>>>>>>>
>>>>>>>> couldn't
>>>>>
>>>>>> do a simple optimization because of the limitations of the
>>>>>>>>>
>>>>>>>> iteration
>>>
>>>> API
>>>>>
>>>>>> [2]. Even if we pushed all the development effort to make the
>>>>>>>>>
>>>>>>>> batch
>>>
>>>> API
>>>>>
>>>>>> more suitable for ML there would be things we couldn't do. E.g.
>>>>>>>>>
>>>>>>>> there
>>>>
>>>>> are
>>>>>
>>>>>> approaches for updating a model iteratively without locks [3,4]
>>>>>>>>>
>>>>>>>> (i.e.
>>>>
>>>>> somewhat asynchronously), and I don't see a clear way to implement
>>>>>>>>>
>>>>>>>> such
>>>>>
>>>>>> algorithms with the batch API.
>>>>>>>>>
>>>>>>>>> 3) Streaming community (users and devs) benefit
>>>>>>>>> The Flink streaming community in general would also benefit from
>>>>>>>>>
>>>>>>>> this
>>>>
>>>>> direction. There are many features needed in the streaming API for
>>>>>>>>>
>>>>>>>> ML
>>>>
>>>>> to
>>>>>
>>>>>> work, but this is also true for the batch API. One really
>>>>>>>>>
>>>>>>>> important
>>>
>>>> is
>>>>
>>>>> the
>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot
>>>>>>>>>
>>>>>>>> of
>>>>
>>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
>>>>>>>>> mentioned
>>>>>>>>> using GPUs, and I'm sure they have uses in streaming generally
>>>>>>>>>
>>>>>>>> [7].
>>>
>>>> Thus,
>>>>>
>>>>>> by improving the streaming API to allow ML algorithms, the
>>>>>>>>>
>>>>>>>> streaming
>>>
>>>> API
>>>>>
>>>>>> benefit too (which is important as they have a lot more production
>>>>>>>>>
>>>>>>>> users
>>>>>
>>>>>> than the batch API).
>>>>>>>>>
>>>>>>>>> 4) Performance can be at least as good
>>>>>>>>> I believe the same performance could be achieved with the
>>>>>>>>>
>>>>>>>> streaming
>>>
>>>> API
>>>>>
>>>>>> as
>>>>>>>>> with the batch API. Streaming API is much closer to the runtime
>>>>>>>>>
>>>>>>>> than
>>>
>>>> the
>>>>>
>>>>>> batch API. For corner-cases, with runtime-layer optimizations of
>>>>>>>>>
>>>>>>>> batch
>>>>
>>>>> API,
>>>>>>>>> we could find a way to do the same (or similar) optimization for
>>>>>>>>>
>>>>>>>> the
>>>
>>>> streaming API (see my previous point). Such case could be using
>>>>>>>>>
>>>>>>>> managed
>>>>>
>>>>>> memory (and spilling to disk). There are also benefits by default,
>>>>>>>>>
>>>>>>>> e.g.
>>>>>
>>>>>> we
>>>>>>>>> would have a finer grained fault tolerance with the streaming API.
>>>>>>>>>
>>>>>>>>> 5) We could keep batch ML API
>>>>>>>>> For the shorter term, we should not throw away all the algorithms
>>>>>>>>> implemented with the batch API. By pushing forward the development
>>>>>>>>>
>>>>>>>> with
>>>>>
>>>>>> side inputs we could make them usable with streaming API. Then, if
>>>>>>>>>
>>>>>>>> the
>>>>
>>>>> library gains some popularity, we could replace the algorithms in
>>>>>>>>>
>>>>>>>> the
>>>>
>>>>> batch
>>>>>>>>> API with streaming ones, to avoid the performance costs of e.g.
>>>>>>>>>
>>>>>>>> not
>>>
>>>> being
>>>>>
>>>>>> able to persist.
>>>>>>>>>
>>>>>>>>> 6) General tools for implementing ML algorithms
>>>>>>>>> Besides implementing algorithms one by one, we could give more
>>>>>>>>>
>>>>>>>> general
>>>>
>>>>> tools for making it easier to implement algorithms. E.g. parameter
>>>>>>>>>
>>>>>>>> server
>>>>>
>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
>>>>>>>>> similar
>>>>>>>>> model to Flink streaming, we could look into that too. I think
>>>>>>>>>
>>>>>>>> often
>>>
>>>> when
>>>>>
>>>>>> deploying a production ML system, much more configuration and
>>>>>>>>>
>>>>>>>> tweaking
>>>>
>>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>>>>>>>>
>>>>>>>>> 7) Showcasing
>>>>>>>>> Showcasing this could be easier. We could say that we're doing
>>>>>>>>>
>>>>>>>> batch
>>>
>>>> ML
>>>>>
>>>>>> with a streaming API. That's interesting in its own. IMHO this
>>>>>>>>> integration
>>>>>>>>> is also a more approachable way towards end-to-end ML.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for reading so far :)
>>>>>>>>>
>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>>>>>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
>>>>>>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>>>>>>>> 13-final77.pdf
>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>>>>>>>> Scoped+Loops+and+Job+Termination
>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.
>>>>>>>>>
>>>>>>>> pdf
>>>
>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>>>>>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>>>>>>>> Parameter-Server-implementation-td15880.html
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Gabor
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>
>>>> *Yours faithfully, *
>>>>
>>>> *Kate Eri.*
>>>>
>>>>
>

Re: [DISCUSS] Flink ML roadmap

Posted by Stephan Ewen <se...@apache.org>.

Hi all!

Sorry for joining this discussion late (I have already missed some of the
deadlines set in this thread).

*Here are some thoughts about what we can do immediately*

  (1) Grow ML community by adding committers with a dedicated. Irrespective
of any direction decision, this is a
       must. I know that the PMC is actively working on this, stay tuned
for some updates.

  (2) I think a repository split helps to make library committer additions
easier, even if it does not go hand in hand
       with a community split. I believe that we can trust committers that
were appointed mainly for their library work to
       commit directly to the library repository and go through pull
requests in the engine/api/connector repository.
       In some sense we have the same thing already: we trust committers to
only commit when they are confident
       in the touched component and submit a pull request if in doubt.
Having the separate repositories makes this rule
       of thumb even simpler.


*On the Roadmap Discussion*

   - Thanks for the collection and discussion already, these are super nice
thoughts. Kudos!

   - My personal take is that the "model evaluation" over streams will be
happening in any case - there
     is genuine interest in that and various users have built that
themselves already.

   - The model evaluation as one step of a streaming pipeline (classifying
events), followed by CEP (pattern detection)
     or anomaly detection is a valuable use case on top of what pure model
serving systems usually do.

   - An "ML training library" is certainly interesting, if the community
can pull it off. More details below.

   - A question I have not yet a good intuition on is whether the "model
evaluation" and the training part are so
    different (one a good abstraction for model evaluation has been built)
that there is little cross coordination needed,
    or whether there is potential in integrating them.


*Thoughts on the ML training library*

  - There seems especially now to be a big trend towards deep learning (is
it just temporary or will this be the future?) and in
     that space, little works without GPU acceleration.

  - It is always easier to do something new than to be the n-th version of
something existing (sorry for the generic true-ism).
    The later admittedly gives the "all in one integrated framework"
advantage (which can be a very strong argument indeed),
    but the former attracts completely new communities and can often make
more noise with less effort.

  - The "new" is not required to be "online learning", where Theo has
described well that this does not look like its taking off.
    It can also be traditional ML re-imagined for "continuous
applications", as "continuous / incremental re-training" or so.
    Even on the "model evaluation side", there is a lot of interesting
stuff as mentioned already, like ensembles, multi-armed bandits, ...

  - It may be well worth tapping into the work of an existing library (like
tensorflow) for an easy fix to some hard problems (pre-existing
    hardware integration, pre-existing optimized linear algebra solvers,
etc) and think about how such use cases would look like in
    the context of typical Flink applications.


*A bit of engine background information that may help in the planning:*

  - The DataStream API will in the future also support bounded data
computations explicitly (I say this not as a fact, but as
    a strong believer that this is the right direction).

  - Batch runtime execution has seen less focus recently, but seems to get
a bit more community focus, because some organizations
    that contribute a lot want to use the batch side as well. For example
the effort on file-grained recovery will strengthen batch a lot already.


Stephan


On Fri, Mar 10, 2017 at 2:38 PM, Till Rohrmann <tr...@apache.org> wrote:

> Hi Roberto,
>
> jpmml looks quite promising and this could be a first step towards the
> model serving story. Thus, looking really forward seeing it being open
> sourced by you guys :-)
>
> @Katherin, I'm not saying that there is no interest in the community to
> work on batch features. However, there is simply not much capacity left to
> mentor such an effort at the moment. I fear without the mentoring from an
> experienced contributor who has worked on the batch part, it will be
> extremely hard to get such a change into the code base. But this will
> hopefully change in the future.
>
> I think the discussion from this thread moved over to [1] and I will
> continue discussing there.
>
> [1]
> http://apache-flink-mailing-list-archive.1008284.n3.
> nabble.com/Machine-Learning-on-Flink-Next-steps-td16334.html#none
>
> Cheers,
> Till
>
>

Re: [DISCUSS] Flink ML roadmap

Posted by Till Rohrmann <tr...@apache.org>.

Hi Roberto,

jpmml looks quite promising and this could be a first step towards the
model serving story. Thus, looking really forward seeing it being open
sourced by you guys :-)

@Katherin, I'm not saying that there is no interest in the community to
work on batch features. However, there is simply not much capacity left to
mentor such an effort at the moment. I fear without the mentoring from an
experienced contributor who has worked on the batch part, it will be
extremely hard to get such a change into the code base. But this will
hopefully change in the future.

I think the discussion from this thread moved over to [1] and I will
continue discussing there.

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Machine-Learning-on-Flink-Next-steps-td16334.html#none

Cheers,
Till

On Wed, Mar 8, 2017 at 1:59 AM, Kavulya, Soila P <so...@intel.com>
wrote:

> Hi Theodore,
>
> We had put together a proposal for an ML DSL in Apache Beam. We had
> developed a couple of scoring engines as part of TAP https://github.com/
> tapanalyticstoolkit/model-scoring-java and https://github.com/
> tapanalyticstoolkit/scoring-pipelines. However, our group is no longer
> actively developing them.
>
> Thanks,
>
> Soila
>
> From: Theodore Vasiloudis [mailto:theodoros.vasiloudis@gmail.com]
> Sent: Friday, March 3, 2017 4:11 AM
> To: dev@flink.apache.org
> Cc: Kavulya, Soila P <so...@intel.com>
> Subject: Re: [DISCUSS] Flink ML roadmap
>
> It seems like a relatively new project, backed by Intel.
>
> My impression from the doc Roberto linked is that they might switch to
> using Beam instead of Spark (?)
> I'm cc'ing Soila who is developer of TAP and has worked on FlinkML in the
> past, perhaps she has some input on how they plan to work with streaming
> and ML in TAP.
>
> Repos:
> [1] https://github.com/tapanalyticstoolkit/
>
> On Fri, Mar 3, 2017 at 12:24 PM, Stavros Kontopoulos <
> st.kontopoulos@gmail.com<ma...@gmail.com>> wrote:
> Interesting  thanx @Roberto.  I see that only TAP Analytics Toolkit
> supports streaming. I am not aware of its market share, anyone?
>
> Best,
> Stavros
>
> On Fri, Mar 3, 2017 at 11:50 AM, Theodore Vasiloudis <
> theodoros.vasiloudis@gmail.com<ma...@gmail.com>>
> wrote:
>
> > Thank you for the links Roberto I did not know that Beam was working on
> an
> > ML abstraction as well. I'm sure we can learn from that.
> >
> > I'll start another thread today where we can discuss next steps and
> action
> > points now that we have a few different paths to follow listed on the
> > shared doc,
> > since our deadline was today. We welcome further discussions of course.
> >
> > Regards,
> > Theodore
> >
> > On Thu, Mar 2, 2017 at 10:52 AM, Roberto Bentivoglio <
> > roberto.bentivoglio@radicalbit.io<mailto:roberto.
> bentivoglio@radicalbit.io>> wrote:
> >
> > > Hi All,
> > >
> > > First of all I'd like to introduce myself: my name is Roberto
> Bentivoglio
> > > and I'm currently working for Radicalbit as Andrea Spina (he already
> > wrote
> > > on this thread).
> > > I didn't have the chance to directly contribute on Flink up to now, but
> > > some colleagues of mine are doing that since at least one year (they
> > > contributed also on the machine learning library).
> > >
> > > I hope I'm not jumping into discussione too late, it's really
> interesting
> > > and the analysis document is depicting really well the scenarios
> > currently
> > > available. Many thanks for your effort!
> > >
> > > If I can add my two cents to the discussion I'd like to add the
> > following:
> > >  - it's clear that currently the Flink community is deeply focused on
> > > streaming features than batch features. For this reason I think that
> > > implement "Offline learning with Streaming API" is really a great idea.
> > >  - I think that the "Online learning" option is really a good fit for
> > > Flink, but maybe we could give at the beginning an higher priority to
> the
> > > "Offline learning with Streaming API" option. However I think that this
> > > option will be the main goal for the mid/long term.
> > >  - we implemented a library based on jpmml-evaluator[1] and flink
> called
> > > "flink-jpmml". Using this library you can train the models on external
> > > systems and use those models, after you've exported in a PMML standard
> > > format, to run evaluations on top of DataStream API. We don't have open
> > > sourced this library up to now, but we're planning to do this in the
> next
> > > weeks. We'd like to complete the documentation and the final code
> reviews
> > > before to share it. I hope it will be helpful for the community to
> > enhance
> > > the ML support on Flink
> > >  - I'd like also to mention that the Apache Beam community is thiking
> on
> > a
> > > ML DSL. There is a design document and a couple of Jira tasks for that
> > > [2][3]
> > >
> > > We're really keen to focus our effort to improve the ML support on
> Flink
> > in
> > > Radicalbit, we will contribute on this effort for sure on a regular
> basis
> > > with our team.
> > >
> > > Looking forward to work with you!
> > >
> > > Many thanks,
> > > Roberto
> > >
> > > [1] - https://github.com/jpmml/jpmml-evaluator
> > > [2] -
> > > https://docs.google.com/document/d/17cRZk_
> yqHm3C0fljivjN66MbLkeKS1yjo4PB
> > > ECHb-xA
> > > [3] - https://issues.apache.org/jira/browse/BEAM-303
> > >
> > > On 28 February 2017 at 19:35, Gábor Hermann <mail@gaborhermann.com
> <ma...@gaborhermann.com>>
> > wrote:
> > >
> > > > Hi Philipp,
> > > >
> > > > It's great to hear you are interested in Flink ML!
> > > >
> > > > Based on your description, your prototype seems like an interesting
> > > > approach for combining online+offline learning. If you're interested,
> > we
> > > > might find a way to integrate your work, or at least your ideas, into
> > > Flink
> > > > ML if we decide on a direction that fits your approach. I think your
> > work
> > > > could be relevant for almost all the directions listed there (if I
> > > > understand correctly you'd even like to serve predictions on
> unlabeled
> > > > data).
> > > >
> > > > Feel free to join the discussion in the docs you've mentioned :)
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > >
> > > > On 2017-02-27 18:39, Philipp Zehnder wrote:
> > > >
> > > > Hello all,
> > > >>
> > > >> I’m new to this mailing list and I wanted to introduce myself. My
> name
> > > is
> > > >> Philipp Zehnder and I’m a Masters Student in Computer Science at the
> > > >> Karlsruhe Institute of Technology in Germany currently writing on my
> > > >> master’s thesis with the main goal to integrate reusable machine
> > > learning
> > > >> components into a stream processing network. One part of my thesis
> is
> > to
> > > >> create an API for distributed online machine learning.
> > > >>
> > > >> I saw that there are some recent discussions how to continue the
> > > >> development of Flink ML [1] and I want to share some of my
> experiences
> > > and
> > > >> maybe get some feedback from the community for my ideas.
> > > >>
> > > >> As I am new to open source projects I hope this is the right place
> for
> > > >> this.
> > > >>
> > > >> In the beginning, I had a look at different already existing
> > frameworks
> > > >> like Apache SAMOA for example, which is great and has a lot of
> useful
> > > >> resources. However, as Flink is currently focusing on streaming,
> from
> > my
> > > >> point of view it makes sense to also have a streaming machine
> learning
> > > API
> > > >> as part of the Flink ecosystem.
> > > >>
> > > >> I’m currently working on building a prototype for a distributed
> > > streaming
> > > >> machine learning library based on Flink that can be used for online
> > and
> > > >> “classical” offline learning.
> > > >>
> > > >> The machine learning algorithm takes labeled and non-labeled data.
> On
> > a
> > > >> labeled data point first a prediction is performed and then this
> label
> > > is
> > > >> used to train the model. On a non-labeled data point just a
> prediction
> > > is
> > > >> performed. The main difference between the online and offline
> > > algorithms is
> > > >> that in the offline case the labeled data must be handed to the
> model
> > > >> before the unlabeled data. In the online case, it is still possible
> to
> > > >> process labeled data at a later point to update the model. The
> > > advantage of
> > > >> this approach is that batch algorithms can be applied on streaming
> > data
> > > as
> > > >> well as online algorithms can be supported.
> > > >>
> > > >> One difference to batch learning are the transformers that are used
> to
> > > >> preprocess the data. For example, a simple mean subtraction must be
> > > >> implemented with a rolling mean, because we can’t calculate the mean
> > > over
> > > >> all the data, but the Flink Streaming API is perfect for that. It
> > would
> > > be
> > > >> useful for users to have an extensible toolbox of transformers.
> > > >>
> > > >> Another difference is the evaluation of the models. As we don’t
> have a
> > > >> single value to determine the model quality, in streaming scenarios
> > this
> > > >> value evolves over time when it sees more labeled data.
> > > >>
> > > >> However, the transformation and evaluation works again similar in
> both
> > > >> online learning and offline learning.
> > > >>
> > > >> I also liked the discussion in [2] and I think that the competition
> in
> > > >> the batch learning field is hard and there are already a lot of
> great
> > > >> projects. I think it is true that in most real world problems it is
> > not
> > > >> necessary to update the model immediately, but there are a lot of
> use
> > > cases
> > > >> for machine learning on streams. For them it would be nice to have a
> > > native
> > > >> approach.
> > > >>
> > > >> A stream machine learning API for Flink would fit very well and I
> > would
> > > >> also be willing to contribute to the future development of the Flink
> > ML
> > > >> library.
> > > >>
> > > >>
> > > >>
> > > >> Best regards,
> > > >>
> > > >> Philipp
> > > >>
> > > >> [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > > >> com/DISCUSS-Flink-ML-roadmap-td16040.html <
> > > http://apache-flink-mailing-l
> > > >> ist-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-<
> http://ist-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap->
> > td16040.html
> > > >
> > > >> [2] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> > > >> 49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2 <
> > > >> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQ
> > > >> c49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2>
> > > >>
> > > >>
> > > >> Am 23.02.2017 um 15:48 schrieb Gábor Hermann <mail@gaborhermann.com
> <ma...@gaborhermann.com>>:
> > > >>>
> > > >>> Okay, I've created a skeleton of the design doc for choosing a
> > > direction:
> > > >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> > > >>> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
> > > >>>
> > > >>> Much of the pros/cons have already been discussed here, so I'll try
> > to
> > > >>> put there all the arguments mentioned in this thread. Feel free to
> > put
> > > >>> there more :)
> > > >>>
> > > >>> @Stavros: I agree we should take action fast. What about collecting
> > our
> > > >>> thoughts in the doc by around Tuesday next week (28. February)?
> Then
> > > decide
> > > >>> on the direction and design a roadmap by around Friday (3. March)?
> Is
> > > that
> > > >>> feasible, or should it take more time?
> > > >>>
> > > >>> I think it will be necessary to have a shepherd, or even better a
> > > >>> committer, to be involved in at least reviewing and accepting the
> > > roadmap.
> > > >>> It would be best, if a committer coordinated all this.
> > > >>> @Theodore: Would you like to do the coordination?
> > > >>>
> > > >>> Regarding the use-cases: I've seen some abstracts of talks at SF
> > Flink
> > > >>> Forward [1] that seem promising. There are companies already using
> > > Flink
> > > >>> for ML [2,3,4,5].
> > > >>>
> > > >>> [1] http://sf.flink-forward.org/program/sessions/
> > > >>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
> > > >>> eaming-vs-micro-batch-for-online-learning/
> > > >>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-te
> > > >>> nsorflow/
> > > >>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
> > > >>> arning-on-flink/
> > > >>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
> > > >>> ing-scenarios-with-flink/
> > > >>>
> > > >>> Cheers,
> > > >>> Gabor
> > > >>>
> > > >>>
> > > >>> On 2017-02-23 15:19, Katherin Eri wrote:
> > > >>>
> > > >>>> I have asked already some teams for useful cases, but all of them
> > need
> > > >>>> time
> > > >>>> to think.
> > > >>>> During analysis something will finally arise.
> > > >>>> May be we can ask partners of Flink  for cases? Data Artisans got
> > > >>>> results
> > > >>>> of customers survey: [1], ML better support is wanted, so we could
> > ask
> > > >>>> what
> > > >>>> exactly is necessary.
> > > >>>>
> > > >>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
> > > >>>>
> > > >>>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
> > > >>>> st.kontopoulos@gmail.com<ma...@gmail.com>>
> написал:
> > > >>>>
> > > >>>> +100 for a design doc.
> > > >>>>>
> > > >>>>> Could we also set a roadmap after some time-boxed investigation
> > > >>>>> captured in
> > > >>>>> that document? We need action.
> > > >>>>>
> > > >>>>> Looking forward to work on this (whatever that might be) ;) Also
> > are
> > > >>>>> there
> > > >>>>> any data supporting one direction or the other from a customer
> > > >>>>> perspective?
> > > >>>>> It would help to make more informed decisions.
> > > >>>>>
> > > >>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <
> > > katherinmail@gmail.com<ma...@gmail.com>>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>> Yes, ok.
> > > >>>>>> let's start some design document, and write down there already
> > > >>>>>> mentioned
> > > >>>>>> ideas about: parameter server, about clipper and others. Would
> be
> > > >>>>>> nice if
> > > >>>>>> we will also map this approaches to cases.
> > > >>>>>> Will work on it collaboratively on each topic, may be finally we
> > > will
> > > >>>>>>
> > > >>>>> form
> > > >>>>>
> > > >>>>>> some picture, that could be agreed with committers.
> > > >>>>>> @Gabor, could you please start such shared doc, as you have
> > already
> > > >>>>>>
> > > >>>>> several
> > > >>>>>
> > > >>>>>> ideas proposed?
> > > >>>>>>
> > > >>>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <mail@gaborhermann.com
> <ma...@gaborhermann.com>>:
> > > >>>>>>
> > > >>>>>> I agree, that it's better to go in one direction first, but I
> > think
> > > >>>>>>> online and offline with streaming API can go somewhat parallel
> > > later.
> > > >>>>>>>
> > > >>>>>> We
> > > >>>>>
> > > >>>>>> could set a short-term goal, concentrate initially on one
> > direction,
> > > >>>>>>>
> > > >>>>>> and
> > > >>>>>
> > > >>>>>> showcase that direction (e.g. in a blogpost). But first, we
> should
> > > >>>>>>> list
> > > >>>>>>> the pros/cons in a design doc as a minimum. Then make a
> decision
> > > what
> > > >>>>>>> direction to go. Would that be feasible?
> > > >>>>>>>
> > > >>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
> > > >>>>>>>
> > > >>>>>>> I'm not sure that this is feasible, doing all at the same time
> > > could
> > > >>>>>>>>
> > > >>>>>>> mean
> > > >>>>>>
> > > >>>>>>> doing nothing((((
> > > >>>>>>>> I'm just afraid, that words: we will work on streaming not on
> > > >>>>>>>>
> > > >>>>>>> batching,
> > > >>>>>
> > > >>>>>> we
> > > >>>>>>>
> > > >>>>>>>> have no commiter's time for this, mean that yes, we started
> work
> > > on
> > > >>>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
> > > >>>>>>>>
> > > >>>>>>> already
> > > >>>>>
> > > >>>>>> was
> > > >>>>>>>
> > > >>>>>>>> with this ticket.
> > > >>>>>>>>
> > > >>>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
> > > >>>>>>>>
> > > >>>>>>> mail@gaborhermann.com<ma...@gaborhermann.com>>
> > > >>>>>>>
> > > >>>>>>>> написал:
> > > >>>>>>>>
> > > >>>>>>>> @Theodore: Great to hear you think the "batch on streaming"
> > > approach
> > > >>>>>>>>>
> > > >>>>>>>> is
> > > >>>>>>
> > > >>>>>>> possible! Of course, we need to pay attention all the pitfalls
> > > >>>>>>>>>
> > > >>>>>>>> there,
> > > >>>>>
> > > >>>>>> if we
> > > >>>>>>>
> > > >>>>>>>> go that way.
> > > >>>>>>>>>
> > > >>>>>>>>> +1 for a design doc!
> > > >>>>>>>>>
> > > >>>>>>>>> I would add that it's possible to make efforts in all the
> three
> > > >>>>>>>>>
> > > >>>>>>>> directions
> > > >>>>>>>
> > > >>>>>>>> (i.e. batch, online, batch on streaming) at the same time.
> > > Although,
> > > >>>>>>>>>
> > > >>>>>>>> it
> > > >>>>>>
> > > >>>>>>> might be worth to concentrate on one. E.g. it would not be so
> > > useful
> > > >>>>>>>>>
> > > >>>>>>>> to
> > > >>>>>>
> > > >>>>>>> have the same batch algorithms with both the batch API and
> > > streaming
> > > >>>>>>>>>
> > > >>>>>>>> API.
> > > >>>>>>>
> > > >>>>>>>> We can decide later.
> > > >>>>>>>>>
> > > >>>>>>>>> The design doc could be partitioned to these 3 directions,
> and
> > we
> > > >>>>>>>>>
> > > >>>>>>>> can
> > > >>>>>
> > > >>>>>> collect there the pros/cons too. What do you think?
> > > >>>>>>>>>
> > > >>>>>>>>> Cheers,
> > > >>>>>>>>> Gabor
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Hello all,
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> @Gabor, we have discussed the idea of using the streaming
> API
> > to
> > > >>>>>>>>>>
> > > >>>>>>>>> write
> > > >>>>>>
> > > >>>>>>> all
> > > >>>>>>>
> > > >>>>>>>> of our ML algorithms with a couple of people offline,
> > > >>>>>>>>>> and I think it might be possible and is generally worth a
> > shot.
> > > >>>>>>>>>> The
> > > >>>>>>>>>> approach we would take would be close to Vowpal Wabbit, not
> > > >>>>>>>>>> exactly
> > > >>>>>>>>>> "online", but rather "fast-batch".
> > > >>>>>>>>>>
> > > >>>>>>>>>> There will be problems popping up again, even for very
> simple
> > > >>>>>>>>>> algos
> > > >>>>>>>>>>
> > > >>>>>>>>> like
> > > >>>>>>>
> > > >>>>>>>> on
> > > >>>>>>>>>> line linear regression with SGD [1], but hopefully fixing
> > those
> > > >>>>>>>>>>
> > > >>>>>>>>> will
> > > >>>>>
> > > >>>>>> be
> > > >>>>>>
> > > >>>>>>> more aligned with the priorities of the community.
> > > >>>>>>>>>>
> > > >>>>>>>>>> @Katherin, my understanding is that given the limited
> > resources,
> > > >>>>>>>>>>
> > > >>>>>>>>> there
> > > >>>>>>
> > > >>>>>>> is
> > > >>>>>>>
> > > >>>>>>>> no development effort focused on batch processing right now.
> > > >>>>>>>>>>
> > > >>>>>>>>>> So to summarize, it seems like there are people willing to
> > work
> > > on
> > > >>>>>>>>>>
> > > >>>>>>>>> ML
> > > >>>>>
> > > >>>>>> on
> > > >>>>>>>
> > > >>>>>>>> Flink, but nobody is sure how to do it.
> > > >>>>>>>>>> There are many directions we could take (batch, online,
> batch
> > on
> > > >>>>>>>>>> streaming), each with its own merits and downsides.
> > > >>>>>>>>>>
> > > >>>>>>>>>> If you want we can start a design doc and move the
> > conversation
> > > >>>>>>>>>>
> > > >>>>>>>>> there,
> > > >>>>>>
> > > >>>>>>> come
> > > >>>>>>>>>> up with a roadmap and start implementing.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Regards,
> > > >>>>>>>>>> Theodore
> > > >>>>>>>>>>
> > > >>>>>>>>>> [1]
> > > >>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
> > > >>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-
> times<http://nabble.com/Understanding-connected-streams-use-without-times>
> > > >>>>>>>>>> tamps-td10241.html
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
> > > >>>>>>>>>>
> > > >>>>>>>>> mail@gaborhermann.com<ma...@gaborhermann.com>
> > > >>>>>>
> > > >>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> It's great to see so much activity in this discussion :)
> > > >>>>>>>>>>
> > > >>>>>>>>>>> I'll try to add my thoughts.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I think building a developer community (Till's 2. point)
> can
> > be
> > > >>>>>>>>>>>
> > > >>>>>>>>>> slightly
> > > >>>>>>>
> > > >>>>>>>> separated from what features we should aim for (1. point) and
> > > >>>>>>>>>>>
> > > >>>>>>>>>> showcasing
> > > >>>>>>>
> > > >>>>>>>> (3. point). Thanks Till for bringing up the ideas for
> > > >>>>>>>>>>>
> > > >>>>>>>>>> restructuring,
> > > >>>>>
> > > >>>>>> I'm
> > > >>>>>>>
> > > >>>>>>>> sure we'll find a way to make the development process more
> > > >>>>>>>>>>>
> > > >>>>>>>>>> dynamic.
> > > >>>>>
> > > >>>>>> I'll
> > > >>>>>>>
> > > >>>>>>>> try to address the rest here.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> It's hard to choose directions between streaming and batch
> > ML.
> > > As
> > > >>>>>>>>>>>
> > > >>>>>>>>>> Theo
> > > >>>>>>
> > > >>>>>>> has
> > > >>>>>>>>>>> indicated, not much online ML is used in production, but
> > Flink
> > > >>>>>>>>>>> concentrates
> > > >>>>>>>>>>> on streaming, so online ML would be a better fit for Flink.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> However,
> > > >>>>>
> > > >>>>>> as
> > > >>>>>>>
> > > >>>>>>>> most of you argued, there's definite need for batch ML. But
> > batch
> > > >>>>>>>>>>>
> > > >>>>>>>>>> ML
> > > >>>>>
> > > >>>>>> seems
> > > >>>>>>>>>>> hard to achieve because there are blocking issues with
> > > >>>>>>>>>>> persisting,
> > > >>>>>>>>>>> iteration paths etc. So it's no good either way.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I propose a seemingly crazy solution: what if we developed
> > > batch
> > > >>>>>>>>>>> algorithms also with the streaming API? The batch API would
> > > >>>>>>>>>>>
> > > >>>>>>>>>> clearly
> > > >>>>>
> > > >>>>>> seem
> > > >>>>>>>
> > > >>>>>>>> more suitable for ML algorithms, but there a lot of benefits
> of
> > > >>>>>>>>>>>
> > > >>>>>>>>>> this
> > > >>>>>
> > > >>>>>> approach too, so it's clearly worth considering. Flink also has
> > > >>>>>>>>>>>
> > > >>>>>>>>>> the
> > > >>>>>
> > > >>>>>> high
> > > >>>>>>>
> > > >>>>>>>> level vision of "streaming for everything" that would clearly
> > fit
> > > >>>>>>>>>>>
> > > >>>>>>>>>> this
> > > >>>>>>
> > > >>>>>>> case. What do you all think about this? Do you think this
> > solution
> > > >>>>>>>>>>>
> > > >>>>>>>>>> would
> > > >>>>>>>
> > > >>>>>>>> be
> > > >>>>>>>>>>> feasible? I would be happy to make a more elaborate
> proposal,
> > > but
> > > >>>>>>>>>>>
> > > >>>>>>>>>> I
> > > >>>>>
> > > >>>>>> push
> > > >>>>>>>
> > > >>>>>>>> my
> > > >>>>>>>>>>> main ideas here:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 1) Simplifying by using one system
> > > >>>>>>>>>>> It could simplify the work of both the users and the
> > > developers.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> One
> > > >>>>>
> > > >>>>>> could
> > > >>>>>>>>>>> execute training once, or could execute it periodically
> e.g.
> > by
> > > >>>>>>>>>>>
> > > >>>>>>>>>> using
> > > >>>>>>
> > > >>>>>>> windows. Low-latency serving and training could be done in the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> same
> > > >>>>>
> > > >>>>>> system.
> > > >>>>>>>>>>> We could implement incremental algorithms, without any side
> > > >>>>>>>>>>> inputs
> > > >>>>>>>>>>>
> > > >>>>>>>>>> for
> > > >>>>>>
> > > >>>>>>> combining online learning (or predictions) with batch learning.
> > Of
> > > >>>>>>>>>>> course,
> > > >>>>>>>>>>> all the logic describing these must be somehow implemented
> > > (e.g.
> > > >>>>>>>>>>> synchronizing predictions with training), but it should be
> > > easier
> > > >>>>>>>>>>>
> > > >>>>>>>>>> to
> > > >>>>>
> > > >>>>>> do
> > > >>>>>>>
> > > >>>>>>>> so
> > > >>>>>>>>>>> in one system, than by combining e.g. the batch and
> streaming
> > > >>>>>>>>>>> API.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 2) Batch ML with the streaming API is not harder
> > > >>>>>>>>>>> Despite these benefits, it could seem harder to implement
> > batch
> > > >>>>>>>>>>> ML
> > > >>>>>>>>>>>
> > > >>>>>>>>>> with
> > > >>>>>>>
> > > >>>>>>>> the streaming API, but in my opinion it's not. There are more
> > > >>>>>>>>>>>
> > > >>>>>>>>>> flexible,
> > > >>>>>>>
> > > >>>>>>>> lower-level optimization potentials with the streaming API.
> Most
> > > >>>>>>>>>>> distributed ML algorithms use a lower-level model than the
> > > batch
> > > >>>>>>>>>>>
> > > >>>>>>>>>> API
> > > >>>>>
> > > >>>>>> anyway, so sometimes it feels like forcing the algorithm logic
> > > >>>>>>>>>>>
> > > >>>>>>>>>> into
> > > >>>>>
> > > >>>>>> the
> > > >>>>>>>
> > > >>>>>>>> training API and tweaking it. Although we could not use the
> > batch
> > > >>>>>>>>>>> primitives like join, we would have the E.g. in my
> experience
> > > >>>>>>>>>>> with
> > > >>>>>>>>>>> implementing a distributed matrix factorization algorithm
> > [1],
> > > I
> > > >>>>>>>>>>>
> > > >>>>>>>>>> couldn't
> > > >>>>>>>
> > > >>>>>>>> do a simple optimization because of the limitations of the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> iteration
> > > >>>>>
> > > >>>>>> API
> > > >>>>>>>
> > > >>>>>>>> [2]. Even if we pushed all the development effort to make the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> batch
> > > >>>>>
> > > >>>>>> API
> > > >>>>>>>
> > > >>>>>>>> more suitable for ML there would be things we couldn't do.
> E.g.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> there
> > > >>>>>>
> > > >>>>>>> are
> > > >>>>>>>
> > > >>>>>>>> approaches for updating a model iteratively without locks
> [3,4]
> > > >>>>>>>>>>>
> > > >>>>>>>>>> (i.e.
> > > >>>>>>
> > > >>>>>>> somewhat asynchronously), and I don't see a clear way to
> > implement
> > > >>>>>>>>>>>
> > > >>>>>>>>>> such
> > > >>>>>>>
> > > >>>>>>>> algorithms with the batch API.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 3) Streaming community (users and devs) benefit
> > > >>>>>>>>>>> The Flink streaming community in general would also benefit
> > > from
> > > >>>>>>>>>>>
> > > >>>>>>>>>> this
> > > >>>>>>
> > > >>>>>>> direction. There are many features needed in the streaming API
> > for
> > > >>>>>>>>>>>
> > > >>>>>>>>>> ML
> > > >>>>>>
> > > >>>>>>> to
> > > >>>>>>>
> > > >>>>>>>> work, but this is also true for the batch API. One really
> > > >>>>>>>>>>>
> > > >>>>>>>>>> important
> > > >>>>>
> > > >>>>>> is
> > > >>>>>>
> > > >>>>>>> the
> > > >>>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has
> been
> > a
> > > >>>>>>>>>>> lot
> > > >>>>>>>>>>>
> > > >>>>>>>>>> of
> > > >>>>>>
> > > >>>>>>> effort (mostly from Paris) for making it mature enough [6].
> Kate
> > > >>>>>>>>>>> mentioned
> > > >>>>>>>>>>> using GPUs, and I'm sure they have uses in streaming
> > generally
> > > >>>>>>>>>>>
> > > >>>>>>>>>> [7].
> > > >>>>>
> > > >>>>>> Thus,
> > > >>>>>>>
> > > >>>>>>>> by improving the streaming API to allow ML algorithms, the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> streaming
> > > >>>>>
> > > >>>>>> API
> > > >>>>>>>
> > > >>>>>>>> benefit too (which is important as they have a lot more
> > production
> > > >>>>>>>>>>>
> > > >>>>>>>>>> users
> > > >>>>>>>
> > > >>>>>>>> than the batch API).
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 4) Performance can be at least as good
> > > >>>>>>>>>>> I believe the same performance could be achieved with the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> streaming
> > > >>>>>
> > > >>>>>> API
> > > >>>>>>>
> > > >>>>>>>> as
> > > >>>>>>>>>>> with the batch API. Streaming API is much closer to the
> > runtime
> > > >>>>>>>>>>>
> > > >>>>>>>>>> than
> > > >>>>>
> > > >>>>>> the
> > > >>>>>>>
> > > >>>>>>>> batch API. For corner-cases, with runtime-layer optimizations
> of
> > > >>>>>>>>>>>
> > > >>>>>>>>>> batch
> > > >>>>>>
> > > >>>>>>> API,
> > > >>>>>>>>>>> we could find a way to do the same (or similar)
> optimization
> > > for
> > > >>>>>>>>>>>
> > > >>>>>>>>>> the
> > > >>>>>
> > > >>>>>> streaming API (see my previous point). Such case could be using
> > > >>>>>>>>>>>
> > > >>>>>>>>>> managed
> > > >>>>>>>
> > > >>>>>>>> memory (and spilling to disk). There are also benefits by
> > default,
> > > >>>>>>>>>>>
> > > >>>>>>>>>> e.g.
> > > >>>>>>>
> > > >>>>>>>> we
> > > >>>>>>>>>>> would have a finer grained fault tolerance with the
> streaming
> > > >>>>>>>>>>> API.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 5) We could keep batch ML API
> > > >>>>>>>>>>> For the shorter term, we should not throw away all the
> > > algorithms
> > > >>>>>>>>>>> implemented with the batch API. By pushing forward the
> > > >>>>>>>>>>> development
> > > >>>>>>>>>>>
> > > >>>>>>>>>> with
> > > >>>>>>>
> > > >>>>>>>> side inputs we could make them usable with streaming API.
> Then,
> > if
> > > >>>>>>>>>>>
> > > >>>>>>>>>> the
> > > >>>>>>
> > > >>>>>>> library gains some popularity, we could replace the algorithms
> in
> > > >>>>>>>>>>>
> > > >>>>>>>>>> the
> > > >>>>>>
> > > >>>>>>> batch
> > > >>>>>>>>>>> API with streaming ones, to avoid the performance costs of
> > e.g.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> not
> > > >>>>>
> > > >>>>>> being
> > > >>>>>>>
> > > >>>>>>>> able to persist.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 6) General tools for implementing ML algorithms
> > > >>>>>>>>>>> Besides implementing algorithms one by one, we could give
> > more
> > > >>>>>>>>>>>
> > > >>>>>>>>>> general
> > > >>>>>>
> > > >>>>>>> tools for making it easier to implement algorithms. E.g.
> > parameter
> > > >>>>>>>>>>>
> > > >>>>>>>>>> server
> > > >>>>>>>
> > > >>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow
> > has a
> > > >>>>>>>>>>> similar
> > > >>>>>>>>>>> model to Flink streaming, we could look into that too. I
> > think
> > > >>>>>>>>>>>
> > > >>>>>>>>>> often
> > > >>>>>
> > > >>>>>> when
> > > >>>>>>>
> > > >>>>>>>> deploying a production ML system, much more configuration and
> > > >>>>>>>>>>>
> > > >>>>>>>>>> tweaking
> > > >>>>>>
> > > >>>>>>> should be done than e.g. Spark MLlib allows. Why not allow
> that?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 7) Showcasing
> > > >>>>>>>>>>> Showcasing this could be easier. We could say that we're
> > doing
> > > >>>>>>>>>>>
> > > >>>>>>>>>> batch
> > > >>>>>
> > > >>>>>> ML
> > > >>>>>>>
> > > >>>>>>>> with a streaming API. That's interesting in its own. IMHO this
> > > >>>>>>>>>>> integration
> > > >>>>>>>>>>> is also a more approachable way towards end-to-end ML.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks for reading so far :)
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
> > > >>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
> > > >>>>>>>>>>> [3] https://people.eecs.berkeley.
> > edu/~brecht/papers/hogwildTR.
> > > pd
> > > >>>>>>>>>>> f
> > > >>>>>>>>>>> [4] https://www.usenix.org/system/
> > > files/conference/hotos13/hotos
> > > >>>>>>>>>>> 13-final77.pdf
> > > >>>>>>>>>>> [5] https://cwiki.apache.org/
> confluence/display/FLINK/FLIP-
> > 15+
> > > >>>>>>>>>>> Scoped+Loops+and+Job+Termination
> > > >>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
> > > >>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-
> > > sigmod16.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> pdf
> > > >>>>>
> > > >>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.
> pdf
> > > >>>>>>>>>>> [9] http://apache-flink-mailing-
> > list-archive.1008284.n3.nabble
> > > .
> > > >>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
> > > >>>>>>>>>>> Parameter-Server-implementation-td15880.html
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Cheers,
> > > >>>>>>>>>>> Gabor
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> --
> > > >>>>>>>
> > > >>>>>> *Yours faithfully, *
> > > >>>>>>
> > > >>>>>> *Kate Eri.*
> > > >>>>>>
> > > >>>>>>
> > > >>
> > > >
> > >
> > >
> > > --
> > > Roberto Bentivoglio
> > > CTO
> > > e. roberto.bentivoglio@radicalbit.io<mailto:roberto.
> bentivoglio@radicalbit.io>
> > > Radicalbit S.r.l.
> > > radicalbit.io<http://radicalbit.io>
> > >
> >
>
>

RE: [DISCUSS] Flink ML roadmap

Posted by "Kavulya, Soila P" <so...@intel.com>.

Hi Theodore,

We had put together a proposal for an ML DSL in Apache Beam. We had developed a couple of scoring engines as part of TAP https://github.com/tapanalyticstoolkit/model-scoring-java and https://github.com/tapanalyticstoolkit/scoring-pipelines. However, our group is no longer actively developing them.

Thanks,

Soila

From: Theodore Vasiloudis [mailto:theodoros.vasiloudis@gmail.com]
Sent: Friday, March 3, 2017 4:11 AM
To: dev@flink.apache.org
Cc: Kavulya, Soila P <so...@intel.com>
Subject: Re: [DISCUSS] Flink ML roadmap

It seems like a relatively new project, backed by Intel.

My impression from the doc Roberto linked is that they might switch to using Beam instead of Spark (?)
I'm cc'ing Soila who is developer of TAP and has worked on FlinkML in the past, perhaps she has some input on how they plan to work with streaming and ML in TAP.

Repos:
[1] https://github.com/tapanalyticstoolkit/

On Fri, Mar 3, 2017 at 12:24 PM, Stavros Kontopoulos <st...@gmail.com>> wrote:
Interesting  thanx @Roberto.  I see that only TAP Analytics Toolkit
supports streaming. I am not aware of its market share, anyone?

Best,
Stavros

On Fri, Mar 3, 2017 at 11:50 AM, Theodore Vasiloudis <
theodoros.vasiloudis@gmail.com<ma...@gmail.com>> wrote:

> Thank you for the links Roberto I did not know that Beam was working on an
> ML abstraction as well. I'm sure we can learn from that.
>
> I'll start another thread today where we can discuss next steps and action
> points now that we have a few different paths to follow listed on the
> shared doc,
> since our deadline was today. We welcome further discussions of course.
>
> Regards,
> Theodore
>
> On Thu, Mar 2, 2017 at 10:52 AM, Roberto Bentivoglio <
> roberto.bentivoglio@radicalbit.io<ma...@radicalbit.io>> wrote:
>
> > Hi All,
> >
> > First of all I'd like to introduce myself: my name is Roberto Bentivoglio
> > and I'm currently working for Radicalbit as Andrea Spina (he already
> wrote
> > on this thread).
> > I didn't have the chance to directly contribute on Flink up to now, but
> > some colleagues of mine are doing that since at least one year (they
> > contributed also on the machine learning library).
> >
> > I hope I'm not jumping into discussione too late, it's really interesting
> > and the analysis document is depicting really well the scenarios
> currently
> > available. Many thanks for your effort!
> >
> > If I can add my two cents to the discussion I'd like to add the
> following:
> >  - it's clear that currently the Flink community is deeply focused on
> > streaming features than batch features. For this reason I think that
> > implement "Offline learning with Streaming API" is really a great idea.
> >  - I think that the "Online learning" option is really a good fit for
> > Flink, but maybe we could give at the beginning an higher priority to the
> > "Offline learning with Streaming API" option. However I think that this
> > option will be the main goal for the mid/long term.
> >  - we implemented a library based on jpmml-evaluator[1] and flink called
> > "flink-jpmml". Using this library you can train the models on external
> > systems and use those models, after you've exported in a PMML standard
> > format, to run evaluations on top of DataStream API. We don't have open
> > sourced this library up to now, but we're planning to do this in the next
> > weeks. We'd like to complete the documentation and the final code reviews
> > before to share it. I hope it will be helpful for the community to
> enhance
> > the ML support on Flink
> >  - I'd like also to mention that the Apache Beam community is thiking on
> a
> > ML DSL. There is a design document and a couple of Jira tasks for that
> > [2][3]
> >
> > We're really keen to focus our effort to improve the ML support on Flink
> in
> > Radicalbit, we will contribute on this effort for sure on a regular basis
> > with our team.
> >
> > Looking forward to work with you!
> >
> > Many thanks,
> > Roberto
> >
> > [1] - https://github.com/jpmml/jpmml-evaluator
> > [2] -
> > https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1yjo4PB
> > ECHb-xA
> > [3] - https://issues.apache.org/jira/browse/BEAM-303
> >
> > On 28 February 2017 at 19:35, Gábor Hermann <ma...@gaborhermann.com>>
> wrote:
> >
> > > Hi Philipp,
> > >
> > > It's great to hear you are interested in Flink ML!
> > >
> > > Based on your description, your prototype seems like an interesting
> > > approach for combining online+offline learning. If you're interested,
> we
> > > might find a way to integrate your work, or at least your ideas, into
> > Flink
> > > ML if we decide on a direction that fits your approach. I think your
> work
> > > could be relevant for almost all the directions listed there (if I
> > > understand correctly you'd even like to serve predictions on unlabeled
> > > data).
> > >
> > > Feel free to join the discussion in the docs you've mentioned :)
> > >
> > > Cheers,
> > > Gabor
> > >
> > >
> > > On 2017-02-27 18:39, Philipp Zehnder wrote:
> > >
> > > Hello all,
> > >>
> > >> I’m new to this mailing list and I wanted to introduce myself. My name
> > is
> > >> Philipp Zehnder and I’m a Masters Student in Computer Science at the
> > >> Karlsruhe Institute of Technology in Germany currently writing on my
> > >> master’s thesis with the main goal to integrate reusable machine
> > learning
> > >> components into a stream processing network. One part of my thesis is
> to
> > >> create an API for distributed online machine learning.
> > >>
> > >> I saw that there are some recent discussions how to continue the
> > >> development of Flink ML [1] and I want to share some of my experiences
> > and
> > >> maybe get some feedback from the community for my ideas.
> > >>
> > >> As I am new to open source projects I hope this is the right place for
> > >> this.
> > >>
> > >> In the beginning, I had a look at different already existing
> frameworks
> > >> like Apache SAMOA for example, which is great and has a lot of useful
> > >> resources. However, as Flink is currently focusing on streaming, from
> my
> > >> point of view it makes sense to also have a streaming machine learning
> > API
> > >> as part of the Flink ecosystem.
> > >>
> > >> I’m currently working on building a prototype for a distributed
> > streaming
> > >> machine learning library based on Flink that can be used for online
> and
> > >> “classical” offline learning.
> > >>
> > >> The machine learning algorithm takes labeled and non-labeled data. On
> a
> > >> labeled data point first a prediction is performed and then this label
> > is
> > >> used to train the model. On a non-labeled data point just a prediction
> > is
> > >> performed. The main difference between the online and offline
> > algorithms is
> > >> that in the offline case the labeled data must be handed to the model
> > >> before the unlabeled data. In the online case, it is still possible to
> > >> process labeled data at a later point to update the model. The
> > advantage of
> > >> this approach is that batch algorithms can be applied on streaming
> data
> > as
> > >> well as online algorithms can be supported.
> > >>
> > >> One difference to batch learning are the transformers that are used to
> > >> preprocess the data. For example, a simple mean subtraction must be
> > >> implemented with a rolling mean, because we can’t calculate the mean
> > over
> > >> all the data, but the Flink Streaming API is perfect for that. It
> would
> > be
> > >> useful for users to have an extensible toolbox of transformers.
> > >>
> > >> Another difference is the evaluation of the models. As we don’t have a
> > >> single value to determine the model quality, in streaming scenarios
> this
> > >> value evolves over time when it sees more labeled data.
> > >>
> > >> However, the transformation and evaluation works again similar in both
> > >> online learning and offline learning.
> > >>
> > >> I also liked the discussion in [2] and I think that the competition in
> > >> the batch learning field is hard and there are already a lot of great
> > >> projects. I think it is true that in most real world problems it is
> not
> > >> necessary to update the model immediately, but there are a lot of use
> > cases
> > >> for machine learning on streams. For them it would be nice to have a
> > native
> > >> approach.
> > >>
> > >> A stream machine learning API for Flink would fit very well and I
> would
> > >> also be willing to contribute to the future development of the Flink
> ML
> > >> library.
> > >>
> > >>
> > >>
> > >> Best regards,
> > >>
> > >> Philipp
> > >>
> > >> [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > >> com/DISCUSS-Flink-ML-roadmap-td16040.html <
> > http://apache-flink-mailing-l
> > >> ist-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-<http://ist-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap->
> td16040.html
> > >
> > >> [2] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> > >> 49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2 <
> > >> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQ
> > >> c49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2>
> > >>
> > >>
> > >> Am 23.02.2017 um 15:48 schrieb Gábor Hermann <ma...@gaborhermann.com>>:
> > >>>
> > >>> Okay, I've created a skeleton of the design doc for choosing a
> > direction:
> > >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> > >>> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
> > >>>
> > >>> Much of the pros/cons have already been discussed here, so I'll try
> to
> > >>> put there all the arguments mentioned in this thread. Feel free to
> put
> > >>> there more :)
> > >>>
> > >>> @Stavros: I agree we should take action fast. What about collecting
> our
> > >>> thoughts in the doc by around Tuesday next week (28. February)? Then
> > decide
> > >>> on the direction and design a roadmap by around Friday (3. March)? Is
> > that
> > >>> feasible, or should it take more time?
> > >>>
> > >>> I think it will be necessary to have a shepherd, or even better a
> > >>> committer, to be involved in at least reviewing and accepting the
> > roadmap.
> > >>> It would be best, if a committer coordinated all this.
> > >>> @Theodore: Would you like to do the coordination?
> > >>>
> > >>> Regarding the use-cases: I've seen some abstracts of talks at SF
> Flink
> > >>> Forward [1] that seem promising. There are companies already using
> > Flink
> > >>> for ML [2,3,4,5].
> > >>>
> > >>> [1] http://sf.flink-forward.org/program/sessions/
> > >>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
> > >>> eaming-vs-micro-batch-for-online-learning/
> > >>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-te
> > >>> nsorflow/
> > >>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
> > >>> arning-on-flink/
> > >>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
> > >>> ing-scenarios-with-flink/
> > >>>
> > >>> Cheers,
> > >>> Gabor
> > >>>
> > >>>
> > >>> On 2017-02-23 15:19, Katherin Eri wrote:
> > >>>
> > >>>> I have asked already some teams for useful cases, but all of them
> need
> > >>>> time
> > >>>> to think.
> > >>>> During analysis something will finally arise.
> > >>>> May be we can ask partners of Flink  for cases? Data Artisans got
> > >>>> results
> > >>>> of customers survey: [1], ML better support is wanted, so we could
> ask
> > >>>> what
> > >>>> exactly is necessary.
> > >>>>
> > >>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
> > >>>>
> > >>>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
> > >>>> st.kontopoulos@gmail.com<ma...@gmail.com>> написал:
> > >>>>
> > >>>> +100 for a design doc.
> > >>>>>
> > >>>>> Could we also set a roadmap after some time-boxed investigation
> > >>>>> captured in
> > >>>>> that document? We need action.
> > >>>>>
> > >>>>> Looking forward to work on this (whatever that might be) ;) Also
> are
> > >>>>> there
> > >>>>> any data supporting one direction or the other from a customer
> > >>>>> perspective?
> > >>>>> It would help to make more informed decisions.
> > >>>>>
> > >>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <
> > katherinmail@gmail.com<ma...@gmail.com>>
> > >>>>> wrote:
> > >>>>>
> > >>>>> Yes, ok.
> > >>>>>> let's start some design document, and write down there already
> > >>>>>> mentioned
> > >>>>>> ideas about: parameter server, about clipper and others. Would be
> > >>>>>> nice if
> > >>>>>> we will also map this approaches to cases.
> > >>>>>> Will work on it collaboratively on each topic, may be finally we
> > will
> > >>>>>>
> > >>>>> form
> > >>>>>
> > >>>>>> some picture, that could be agreed with committers.
> > >>>>>> @Gabor, could you please start such shared doc, as you have
> already
> > >>>>>>
> > >>>>> several
> > >>>>>
> > >>>>>> ideas proposed?
> > >>>>>>
> > >>>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <ma...@gaborhermann.com>>:
> > >>>>>>
> > >>>>>> I agree, that it's better to go in one direction first, but I
> think
> > >>>>>>> online and offline with streaming API can go somewhat parallel
> > later.
> > >>>>>>>
> > >>>>>> We
> > >>>>>
> > >>>>>> could set a short-term goal, concentrate initially on one
> direction,
> > >>>>>>>
> > >>>>>> and
> > >>>>>
> > >>>>>> showcase that direction (e.g. in a blogpost). But first, we should
> > >>>>>>> list
> > >>>>>>> the pros/cons in a design doc as a minimum. Then make a decision
> > what
> > >>>>>>> direction to go. Would that be feasible?
> > >>>>>>>
> > >>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
> > >>>>>>>
> > >>>>>>> I'm not sure that this is feasible, doing all at the same time
> > could
> > >>>>>>>>
> > >>>>>>> mean
> > >>>>>>
> > >>>>>>> doing nothing((((
> > >>>>>>>> I'm just afraid, that words: we will work on streaming not on
> > >>>>>>>>
> > >>>>>>> batching,
> > >>>>>
> > >>>>>> we
> > >>>>>>>
> > >>>>>>>> have no commiter's time for this, mean that yes, we started work
> > on
> > >>>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
> > >>>>>>>>
> > >>>>>>> already
> > >>>>>
> > >>>>>> was
> > >>>>>>>
> > >>>>>>>> with this ticket.
> > >>>>>>>>
> > >>>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
> > >>>>>>>>
> > >>>>>>> mail@gaborhermann.com<ma...@gaborhermann.com>>
> > >>>>>>>
> > >>>>>>>> написал:
> > >>>>>>>>
> > >>>>>>>> @Theodore: Great to hear you think the "batch on streaming"
> > approach
> > >>>>>>>>>
> > >>>>>>>> is
> > >>>>>>
> > >>>>>>> possible! Of course, we need to pay attention all the pitfalls
> > >>>>>>>>>
> > >>>>>>>> there,
> > >>>>>
> > >>>>>> if we
> > >>>>>>>
> > >>>>>>>> go that way.
> > >>>>>>>>>
> > >>>>>>>>> +1 for a design doc!
> > >>>>>>>>>
> > >>>>>>>>> I would add that it's possible to make efforts in all the three
> > >>>>>>>>>
> > >>>>>>>> directions
> > >>>>>>>
> > >>>>>>>> (i.e. batch, online, batch on streaming) at the same time.
> > Although,
> > >>>>>>>>>
> > >>>>>>>> it
> > >>>>>>
> > >>>>>>> might be worth to concentrate on one. E.g. it would not be so
> > useful
> > >>>>>>>>>
> > >>>>>>>> to
> > >>>>>>
> > >>>>>>> have the same batch algorithms with both the batch API and
> > streaming
> > >>>>>>>>>
> > >>>>>>>> API.
> > >>>>>>>
> > >>>>>>>> We can decide later.
> > >>>>>>>>>
> > >>>>>>>>> The design doc could be partitioned to these 3 directions, and
> we
> > >>>>>>>>>
> > >>>>>>>> can
> > >>>>>
> > >>>>>> collect there the pros/cons too. What do you think?
> > >>>>>>>>>
> > >>>>>>>>> Cheers,
> > >>>>>>>>> Gabor
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hello all,
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> @Gabor, we have discussed the idea of using the streaming API
> to
> > >>>>>>>>>>
> > >>>>>>>>> write
> > >>>>>>
> > >>>>>>> all
> > >>>>>>>
> > >>>>>>>> of our ML algorithms with a couple of people offline,
> > >>>>>>>>>> and I think it might be possible and is generally worth a
> shot.
> > >>>>>>>>>> The
> > >>>>>>>>>> approach we would take would be close to Vowpal Wabbit, not
> > >>>>>>>>>> exactly
> > >>>>>>>>>> "online", but rather "fast-batch".
> > >>>>>>>>>>
> > >>>>>>>>>> There will be problems popping up again, even for very simple
> > >>>>>>>>>> algos
> > >>>>>>>>>>
> > >>>>>>>>> like
> > >>>>>>>
> > >>>>>>>> on
> > >>>>>>>>>> line linear regression with SGD [1], but hopefully fixing
> those
> > >>>>>>>>>>
> > >>>>>>>>> will
> > >>>>>
> > >>>>>> be
> > >>>>>>
> > >>>>>>> more aligned with the priorities of the community.
> > >>>>>>>>>>
> > >>>>>>>>>> @Katherin, my understanding is that given the limited
> resources,
> > >>>>>>>>>>
> > >>>>>>>>> there
> > >>>>>>
> > >>>>>>> is
> > >>>>>>>
> > >>>>>>>> no development effort focused on batch processing right now.
> > >>>>>>>>>>
> > >>>>>>>>>> So to summarize, it seems like there are people willing to
> work
> > on
> > >>>>>>>>>>
> > >>>>>>>>> ML
> > >>>>>
> > >>>>>> on
> > >>>>>>>
> > >>>>>>>> Flink, but nobody is sure how to do it.
> > >>>>>>>>>> There are many directions we could take (batch, online, batch
> on
> > >>>>>>>>>> streaming), each with its own merits and downsides.
> > >>>>>>>>>>
> > >>>>>>>>>> If you want we can start a design doc and move the
> conversation
> > >>>>>>>>>>
> > >>>>>>>>> there,
> > >>>>>>
> > >>>>>>> come
> > >>>>>>>>>> up with a roadmap and start implementing.
> > >>>>>>>>>>
> > >>>>>>>>>> Regards,
> > >>>>>>>>>> Theodore
> > >>>>>>>>>>
> > >>>>>>>>>> [1]
> > >>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
> > >>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times<http://nabble.com/Understanding-connected-streams-use-without-times>
> > >>>>>>>>>> tamps-td10241.html
> > >>>>>>>>>>
> > >>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
> > >>>>>>>>>>
> > >>>>>>>>> mail@gaborhermann.com<ma...@gaborhermann.com>
> > >>>>>>
> > >>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> It's great to see so much activity in this discussion :)
> > >>>>>>>>>>
> > >>>>>>>>>>> I'll try to add my thoughts.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I think building a developer community (Till's 2. point) can
> be
> > >>>>>>>>>>>
> > >>>>>>>>>> slightly
> > >>>>>>>
> > >>>>>>>> separated from what features we should aim for (1. point) and
> > >>>>>>>>>>>
> > >>>>>>>>>> showcasing
> > >>>>>>>
> > >>>>>>>> (3. point). Thanks Till for bringing up the ideas for
> > >>>>>>>>>>>
> > >>>>>>>>>> restructuring,
> > >>>>>
> > >>>>>> I'm
> > >>>>>>>
> > >>>>>>>> sure we'll find a way to make the development process more
> > >>>>>>>>>>>
> > >>>>>>>>>> dynamic.
> > >>>>>
> > >>>>>> I'll
> > >>>>>>>
> > >>>>>>>> try to address the rest here.
> > >>>>>>>>>>>
> > >>>>>>>>>>> It's hard to choose directions between streaming and batch
> ML.
> > As
> > >>>>>>>>>>>
> > >>>>>>>>>> Theo
> > >>>>>>
> > >>>>>>> has
> > >>>>>>>>>>> indicated, not much online ML is used in production, but
> Flink
> > >>>>>>>>>>> concentrates
> > >>>>>>>>>>> on streaming, so online ML would be a better fit for Flink.
> > >>>>>>>>>>>
> > >>>>>>>>>> However,
> > >>>>>
> > >>>>>> as
> > >>>>>>>
> > >>>>>>>> most of you argued, there's definite need for batch ML. But
> batch
> > >>>>>>>>>>>
> > >>>>>>>>>> ML
> > >>>>>
> > >>>>>> seems
> > >>>>>>>>>>> hard to achieve because there are blocking issues with
> > >>>>>>>>>>> persisting,
> > >>>>>>>>>>> iteration paths etc. So it's no good either way.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I propose a seemingly crazy solution: what if we developed
> > batch
> > >>>>>>>>>>> algorithms also with the streaming API? The batch API would
> > >>>>>>>>>>>
> > >>>>>>>>>> clearly
> > >>>>>
> > >>>>>> seem
> > >>>>>>>
> > >>>>>>>> more suitable for ML algorithms, but there a lot of benefits of
> > >>>>>>>>>>>
> > >>>>>>>>>> this
> > >>>>>
> > >>>>>> approach too, so it's clearly worth considering. Flink also has
> > >>>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>
> > >>>>>> high
> > >>>>>>>
> > >>>>>>>> level vision of "streaming for everything" that would clearly
> fit
> > >>>>>>>>>>>
> > >>>>>>>>>> this
> > >>>>>>
> > >>>>>>> case. What do you all think about this? Do you think this
> solution
> > >>>>>>>>>>>
> > >>>>>>>>>> would
> > >>>>>>>
> > >>>>>>>> be
> > >>>>>>>>>>> feasible? I would be happy to make a more elaborate proposal,
> > but
> > >>>>>>>>>>>
> > >>>>>>>>>> I
> > >>>>>
> > >>>>>> push
> > >>>>>>>
> > >>>>>>>> my
> > >>>>>>>>>>> main ideas here:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1) Simplifying by using one system
> > >>>>>>>>>>> It could simplify the work of both the users and the
> > developers.
> > >>>>>>>>>>>
> > >>>>>>>>>> One
> > >>>>>
> > >>>>>> could
> > >>>>>>>>>>> execute training once, or could execute it periodically e.g.
> by
> > >>>>>>>>>>>
> > >>>>>>>>>> using
> > >>>>>>
> > >>>>>>> windows. Low-latency serving and training could be done in the
> > >>>>>>>>>>>
> > >>>>>>>>>> same
> > >>>>>
> > >>>>>> system.
> > >>>>>>>>>>> We could implement incremental algorithms, without any side
> > >>>>>>>>>>> inputs
> > >>>>>>>>>>>
> > >>>>>>>>>> for
> > >>>>>>
> > >>>>>>> combining online learning (or predictions) with batch learning.
> Of
> > >>>>>>>>>>> course,
> > >>>>>>>>>>> all the logic describing these must be somehow implemented
> > (e.g.
> > >>>>>>>>>>> synchronizing predictions with training), but it should be
> > easier
> > >>>>>>>>>>>
> > >>>>>>>>>> to
> > >>>>>
> > >>>>>> do
> > >>>>>>>
> > >>>>>>>> so
> > >>>>>>>>>>> in one system, than by combining e.g. the batch and streaming
> > >>>>>>>>>>> API.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2) Batch ML with the streaming API is not harder
> > >>>>>>>>>>> Despite these benefits, it could seem harder to implement
> batch
> > >>>>>>>>>>> ML
> > >>>>>>>>>>>
> > >>>>>>>>>> with
> > >>>>>>>
> > >>>>>>>> the streaming API, but in my opinion it's not. There are more
> > >>>>>>>>>>>
> > >>>>>>>>>> flexible,
> > >>>>>>>
> > >>>>>>>> lower-level optimization potentials with the streaming API. Most
> > >>>>>>>>>>> distributed ML algorithms use a lower-level model than the
> > batch
> > >>>>>>>>>>>
> > >>>>>>>>>> API
> > >>>>>
> > >>>>>> anyway, so sometimes it feels like forcing the algorithm logic
> > >>>>>>>>>>>
> > >>>>>>>>>> into
> > >>>>>
> > >>>>>> the
> > >>>>>>>
> > >>>>>>>> training API and tweaking it. Although we could not use the
> batch
> > >>>>>>>>>>> primitives like join, we would have the E.g. in my experience
> > >>>>>>>>>>> with
> > >>>>>>>>>>> implementing a distributed matrix factorization algorithm
> [1],
> > I
> > >>>>>>>>>>>
> > >>>>>>>>>> couldn't
> > >>>>>>>
> > >>>>>>>> do a simple optimization because of the limitations of the
> > >>>>>>>>>>>
> > >>>>>>>>>> iteration
> > >>>>>
> > >>>>>> API
> > >>>>>>>
> > >>>>>>>> [2]. Even if we pushed all the development effort to make the
> > >>>>>>>>>>>
> > >>>>>>>>>> batch
> > >>>>>
> > >>>>>> API
> > >>>>>>>
> > >>>>>>>> more suitable for ML there would be things we couldn't do. E.g.
> > >>>>>>>>>>>
> > >>>>>>>>>> there
> > >>>>>>
> > >>>>>>> are
> > >>>>>>>
> > >>>>>>>> approaches for updating a model iteratively without locks [3,4]
> > >>>>>>>>>>>
> > >>>>>>>>>> (i.e.
> > >>>>>>
> > >>>>>>> somewhat asynchronously), and I don't see a clear way to
> implement
> > >>>>>>>>>>>
> > >>>>>>>>>> such
> > >>>>>>>
> > >>>>>>>> algorithms with the batch API.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 3) Streaming community (users and devs) benefit
> > >>>>>>>>>>> The Flink streaming community in general would also benefit
> > from
> > >>>>>>>>>>>
> > >>>>>>>>>> this
> > >>>>>>
> > >>>>>>> direction. There are many features needed in the streaming API
> for
> > >>>>>>>>>>>
> > >>>>>>>>>> ML
> > >>>>>>
> > >>>>>>> to
> > >>>>>>>
> > >>>>>>>> work, but this is also true for the batch API. One really
> > >>>>>>>>>>>
> > >>>>>>>>>> important
> > >>>>>
> > >>>>>> is
> > >>>>>>
> > >>>>>>> the
> > >>>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been
> a
> > >>>>>>>>>>> lot
> > >>>>>>>>>>>
> > >>>>>>>>>> of
> > >>>>>>
> > >>>>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
> > >>>>>>>>>>> mentioned
> > >>>>>>>>>>> using GPUs, and I'm sure they have uses in streaming
> generally
> > >>>>>>>>>>>
> > >>>>>>>>>> [7].
> > >>>>>
> > >>>>>> Thus,
> > >>>>>>>
> > >>>>>>>> by improving the streaming API to allow ML algorithms, the
> > >>>>>>>>>>>
> > >>>>>>>>>> streaming
> > >>>>>
> > >>>>>> API
> > >>>>>>>
> > >>>>>>>> benefit too (which is important as they have a lot more
> production
> > >>>>>>>>>>>
> > >>>>>>>>>> users
> > >>>>>>>
> > >>>>>>>> than the batch API).
> > >>>>>>>>>>>
> > >>>>>>>>>>> 4) Performance can be at least as good
> > >>>>>>>>>>> I believe the same performance could be achieved with the
> > >>>>>>>>>>>
> > >>>>>>>>>> streaming
> > >>>>>
> > >>>>>> API
> > >>>>>>>
> > >>>>>>>> as
> > >>>>>>>>>>> with the batch API. Streaming API is much closer to the
> runtime
> > >>>>>>>>>>>
> > >>>>>>>>>> than
> > >>>>>
> > >>>>>> the
> > >>>>>>>
> > >>>>>>>> batch API. For corner-cases, with runtime-layer optimizations of
> > >>>>>>>>>>>
> > >>>>>>>>>> batch
> > >>>>>>
> > >>>>>>> API,
> > >>>>>>>>>>> we could find a way to do the same (or similar) optimization
> > for
> > >>>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>
> > >>>>>> streaming API (see my previous point). Such case could be using
> > >>>>>>>>>>>
> > >>>>>>>>>> managed
> > >>>>>>>
> > >>>>>>>> memory (and spilling to disk). There are also benefits by
> default,
> > >>>>>>>>>>>
> > >>>>>>>>>> e.g.
> > >>>>>>>
> > >>>>>>>> we
> > >>>>>>>>>>> would have a finer grained fault tolerance with the streaming
> > >>>>>>>>>>> API.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 5) We could keep batch ML API
> > >>>>>>>>>>> For the shorter term, we should not throw away all the
> > algorithms
> > >>>>>>>>>>> implemented with the batch API. By pushing forward the
> > >>>>>>>>>>> development
> > >>>>>>>>>>>
> > >>>>>>>>>> with
> > >>>>>>>
> > >>>>>>>> side inputs we could make them usable with streaming API. Then,
> if
> > >>>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>>
> > >>>>>>> library gains some popularity, we could replace the algorithms in
> > >>>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>>
> > >>>>>>> batch
> > >>>>>>>>>>> API with streaming ones, to avoid the performance costs of
> e.g.
> > >>>>>>>>>>>
> > >>>>>>>>>> not
> > >>>>>
> > >>>>>> being
> > >>>>>>>
> > >>>>>>>> able to persist.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 6) General tools for implementing ML algorithms
> > >>>>>>>>>>> Besides implementing algorithms one by one, we could give
> more
> > >>>>>>>>>>>
> > >>>>>>>>>> general
> > >>>>>>
> > >>>>>>> tools for making it easier to implement algorithms. E.g.
> parameter
> > >>>>>>>>>>>
> > >>>>>>>>>> server
> > >>>>>>>
> > >>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow
> has a
> > >>>>>>>>>>> similar
> > >>>>>>>>>>> model to Flink streaming, we could look into that too. I
> think
> > >>>>>>>>>>>
> > >>>>>>>>>> often
> > >>>>>
> > >>>>>> when
> > >>>>>>>
> > >>>>>>>> deploying a production ML system, much more configuration and
> > >>>>>>>>>>>
> > >>>>>>>>>> tweaking
> > >>>>>>
> > >>>>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
> > >>>>>>>>>>>
> > >>>>>>>>>>> 7) Showcasing
> > >>>>>>>>>>> Showcasing this could be easier. We could say that we're
> doing
> > >>>>>>>>>>>
> > >>>>>>>>>> batch
> > >>>>>
> > >>>>>> ML
> > >>>>>>>
> > >>>>>>>> with a streaming API. That's interesting in its own. IMHO this
> > >>>>>>>>>>> integration
> > >>>>>>>>>>> is also a more approachable way towards end-to-end ML.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for reading so far :)
> > >>>>>>>>>>>
> > >>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
> > >>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
> > >>>>>>>>>>> [3] https://people.eecs.berkeley.
> edu/~brecht/papers/hogwildTR.
> > pd
> > >>>>>>>>>>> f
> > >>>>>>>>>>> [4] https://www.usenix.org/system/
> > files/conference/hotos13/hotos
> > >>>>>>>>>>> 13-final77.pdf
> > >>>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-
> 15+
> > >>>>>>>>>>> Scoped+Loops+and+Job+Termination
> > >>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
> > >>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-
> > sigmod16.
> > >>>>>>>>>>>
> > >>>>>>>>>> pdf
> > >>>>>
> > >>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
> > >>>>>>>>>>> [9] http://apache-flink-mailing-
> list-archive.1008284.n3.nabble
> > .
> > >>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
> > >>>>>>>>>>> Parameter-Server-implementation-td15880.html
> > >>>>>>>>>>>
> > >>>>>>>>>>> Cheers,
> > >>>>>>>>>>> Gabor
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>
> > >>>>>> *Yours faithfully, *
> > >>>>>>
> > >>>>>> *Kate Eri.*
> > >>>>>>
> > >>>>>>
> > >>
> > >
> >
> >
> > --
> > Roberto Bentivoglio
> > CTO
> > e. roberto.bentivoglio@radicalbit.io<ma...@radicalbit.io>
> > Radicalbit S.r.l.
> > radicalbit.io<http://radicalbit.io>
> >
>

Re: [DISCUSS] Flink ML roadmap

Posted by Theodore Vasiloudis <th...@gmail.com>.

It seems like a relatively new project, backed by Intel.

My impression from the doc Roberto linked is that they might switch to
using Beam instead of Spark (?)

I'm cc'ing Soila who is developer of TAP and has worked on FlinkML in the
past, perhaps she has some input on how they plan to work with streaming
and ML in TAP.

Repos:
[1] https://github.com/tapanalyticstoolkit/

On Fri, Mar 3, 2017 at 12:24 PM, Stavros Kontopoulos <
st.kontopoulos@gmail.com> wrote:

> Interesting  thanx @Roberto.  I see that only TAP Analytics Toolkit
> supports streaming. I am not aware of its market share, anyone?
>
> Best,
> Stavros
>
> On Fri, Mar 3, 2017 at 11:50 AM, Theodore Vasiloudis <
> theodoros.vasiloudis@gmail.com> wrote:
>
> > Thank you for the links Roberto I did not know that Beam was working on
> an
> > ML abstraction as well. I'm sure we can learn from that.
> >
> > I'll start another thread today where we can discuss next steps and
> action
> > points now that we have a few different paths to follow listed on the
> > shared doc,
> > since our deadline was today. We welcome further discussions of course.
> >
> > Regards,
> > Theodore
> >
> > On Thu, Mar 2, 2017 at 10:52 AM, Roberto Bentivoglio <
> > roberto.bentivoglio@radicalbit.io> wrote:
> >
> > > Hi All,
> > >
> > > First of all I'd like to introduce myself: my name is Roberto
> Bentivoglio
> > > and I'm currently working for Radicalbit as Andrea Spina (he already
> > wrote
> > > on this thread).
> > > I didn't have the chance to directly contribute on Flink up to now, but
> > > some colleagues of mine are doing that since at least one year (they
> > > contributed also on the machine learning library).
> > >
> > > I hope I'm not jumping into discussione too late, it's really
> interesting
> > > and the analysis document is depicting really well the scenarios
> > currently
> > > available. Many thanks for your effort!
> > >
> > > If I can add my two cents to the discussion I'd like to add the
> > following:
> > >  - it's clear that currently the Flink community is deeply focused on
> > > streaming features than batch features. For this reason I think that
> > > implement "Offline learning with Streaming API" is really a great idea.
> > >  - I think that the "Online learning" option is really a good fit for
> > > Flink, but maybe we could give at the beginning an higher priority to
> the
> > > "Offline learning with Streaming API" option. However I think that this
> > > option will be the main goal for the mid/long term.
> > >  - we implemented a library based on jpmml-evaluator[1] and flink
> called
> > > "flink-jpmml". Using this library you can train the models on external
> > > systems and use those models, after you've exported in a PMML standard
> > > format, to run evaluations on top of DataStream API. We don't have open
> > > sourced this library up to now, but we're planning to do this in the
> next
> > > weeks. We'd like to complete the documentation and the final code
> reviews
> > > before to share it. I hope it will be helpful for the community to
> > enhance
> > > the ML support on Flink
> > >  - I'd like also to mention that the Apache Beam community is thiking
> on
> > a
> > > ML DSL. There is a design document and a couple of Jira tasks for that
> > > [2][3]
> > >
> > > We're really keen to focus our effort to improve the ML support on
> Flink
> > in
> > > Radicalbit, we will contribute on this effort for sure on a regular
> basis
> > > with our team.
> > >
> > > Looking forward to work with you!
> > >
> > > Many thanks,
> > > Roberto
> > >
> > > [1] - https://github.com/jpmml/jpmml-evaluator
> > > [2] -
> > > https://docs.google.com/document/d/17cRZk_
> yqHm3C0fljivjN66MbLkeKS1yjo4PB
> > > ECHb-xA
> > > [3] - https://issues.apache.org/jira/browse/BEAM-303
> > >
> > > On 28 February 2017 at 19:35, Gábor Hermann <ma...@gaborhermann.com>
> > wrote:
> > >
> > > > Hi Philipp,
> > > >
> > > > It's great to hear you are interested in Flink ML!
> > > >
> > > > Based on your description, your prototype seems like an interesting
> > > > approach for combining online+offline learning. If you're interested,
> > we
> > > > might find a way to integrate your work, or at least your ideas, into
> > > Flink
> > > > ML if we decide on a direction that fits your approach. I think your
> > work
> > > > could be relevant for almost all the directions listed there (if I
> > > > understand correctly you'd even like to serve predictions on
> unlabeled
> > > > data).
> > > >
> > > > Feel free to join the discussion in the docs you've mentioned :)
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > >
> > > > On 2017-02-27 18:39, Philipp Zehnder wrote:
> > > >
> > > > Hello all,
> > > >>
> > > >> I’m new to this mailing list and I wanted to introduce myself. My
> name
> > > is
> > > >> Philipp Zehnder and I’m a Masters Student in Computer Science at the
> > > >> Karlsruhe Institute of Technology in Germany currently writing on my
> > > >> master’s thesis with the main goal to integrate reusable machine
> > > learning
> > > >> components into a stream processing network. One part of my thesis
> is
> > to
> > > >> create an API for distributed online machine learning.
> > > >>
> > > >> I saw that there are some recent discussions how to continue the
> > > >> development of Flink ML [1] and I want to share some of my
> experiences
> > > and
> > > >> maybe get some feedback from the community for my ideas.
> > > >>
> > > >> As I am new to open source projects I hope this is the right place
> for
> > > >> this.
> > > >>
> > > >> In the beginning, I had a look at different already existing
> > frameworks
> > > >> like Apache SAMOA for example, which is great and has a lot of
> useful
> > > >> resources. However, as Flink is currently focusing on streaming,
> from
> > my
> > > >> point of view it makes sense to also have a streaming machine
> learning
> > > API
> > > >> as part of the Flink ecosystem.
> > > >>
> > > >> I’m currently working on building a prototype for a distributed
> > > streaming
> > > >> machine learning library based on Flink that can be used for online
> > and
> > > >> “classical” offline learning.
> > > >>
> > > >> The machine learning algorithm takes labeled and non-labeled data.
> On
> > a
> > > >> labeled data point first a prediction is performed and then this
> label
> > > is
> > > >> used to train the model. On a non-labeled data point just a
> prediction
> > > is
> > > >> performed. The main difference between the online and offline
> > > algorithms is
> > > >> that in the offline case the labeled data must be handed to the
> model
> > > >> before the unlabeled data. In the online case, it is still possible
> to
> > > >> process labeled data at a later point to update the model. The
> > > advantage of
> > > >> this approach is that batch algorithms can be applied on streaming
> > data
> > > as
> > > >> well as online algorithms can be supported.
> > > >>
> > > >> One difference to batch learning are the transformers that are used
> to
> > > >> preprocess the data. For example, a simple mean subtraction must be
> > > >> implemented with a rolling mean, because we can’t calculate the mean
> > > over
> > > >> all the data, but the Flink Streaming API is perfect for that. It
> > would
> > > be
> > > >> useful for users to have an extensible toolbox of transformers.
> > > >>
> > > >> Another difference is the evaluation of the models. As we don’t
> have a
> > > >> single value to determine the model quality, in streaming scenarios
> > this
> > > >> value evolves over time when it sees more labeled data.
> > > >>
> > > >> However, the transformation and evaluation works again similar in
> both
> > > >> online learning and offline learning.
> > > >>
> > > >> I also liked the discussion in [2] and I think that the competition
> in
> > > >> the batch learning field is hard and there are already a lot of
> great
> > > >> projects. I think it is true that in most real world problems it is
> > not
> > > >> necessary to update the model immediately, but there are a lot of
> use
> > > cases
> > > >> for machine learning on streams. For them it would be nice to have a
> > > native
> > > >> approach.
> > > >>
> > > >> A stream machine learning API for Flink would fit very well and I
> > would
> > > >> also be willing to contribute to the future development of the Flink
> > ML
> > > >> library.
> > > >>
> > > >>
> > > >>
> > > >> Best regards,
> > > >>
> > > >> Philipp
> > > >>
> > > >> [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > > >> com/DISCUSS-Flink-ML-roadmap-td16040.html <
> > > http://apache-flink-mailing-l
> > > >> ist-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-
> > td16040.html
> > > >
> > > >> [2] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> > > >> 49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2 <
> > > >> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQ
> > > >> c49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2>
> > > >>
> > > >>
> > > >> Am 23.02.2017 um 15:48 schrieb Gábor Hermann <mail@gaborhermann.com
> >:
> > > >>>
> > > >>> Okay, I've created a skeleton of the design doc for choosing a
> > > direction:
> > > >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> > > >>> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
> > > >>>
> > > >>> Much of the pros/cons have already been discussed here, so I'll try
> > to
> > > >>> put there all the arguments mentioned in this thread. Feel free to
> > put
> > > >>> there more :)
> > > >>>
> > > >>> @Stavros: I agree we should take action fast. What about collecting
> > our
> > > >>> thoughts in the doc by around Tuesday next week (28. February)?
> Then
> > > decide
> > > >>> on the direction and design a roadmap by around Friday (3. March)?
> Is
> > > that
> > > >>> feasible, or should it take more time?
> > > >>>
> > > >>> I think it will be necessary to have a shepherd, or even better a
> > > >>> committer, to be involved in at least reviewing and accepting the
> > > roadmap.
> > > >>> It would be best, if a committer coordinated all this.
> > > >>> @Theodore: Would you like to do the coordination?
> > > >>>
> > > >>> Regarding the use-cases: I've seen some abstracts of talks at SF
> > Flink
> > > >>> Forward [1] that seem promising. There are companies already using
> > > Flink
> > > >>> for ML [2,3,4,5].
> > > >>>
> > > >>> [1] http://sf.flink-forward.org/program/sessions/
> > > >>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
> > > >>> eaming-vs-micro-batch-for-online-learning/
> > > >>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-te
> > > >>> nsorflow/
> > > >>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
> > > >>> arning-on-flink/
> > > >>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
> > > >>> ing-scenarios-with-flink/
> > > >>>
> > > >>> Cheers,
> > > >>> Gabor
> > > >>>
> > > >>>
> > > >>> On 2017-02-23 15:19, Katherin Eri wrote:
> > > >>>
> > > >>>> I have asked already some teams for useful cases, but all of them
> > need
> > > >>>> time
> > > >>>> to think.
> > > >>>> During analysis something will finally arise.
> > > >>>> May be we can ask partners of Flink  for cases? Data Artisans got
> > > >>>> results
> > > >>>> of customers survey: [1], ML better support is wanted, so we could
> > ask
> > > >>>> what
> > > >>>> exactly is necessary.
> > > >>>>
> > > >>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
> > > >>>>
> > > >>>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
> > > >>>> st.kontopoulos@gmail.com> написал:
> > > >>>>
> > > >>>> +100 for a design doc.
> > > >>>>>
> > > >>>>> Could we also set a roadmap after some time-boxed investigation
> > > >>>>> captured in
> > > >>>>> that document? We need action.
> > > >>>>>
> > > >>>>> Looking forward to work on this (whatever that might be) ;) Also
> > are
> > > >>>>> there
> > > >>>>> any data supporting one direction or the other from a customer
> > > >>>>> perspective?
> > > >>>>> It would help to make more informed decisions.
> > > >>>>>
> > > >>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <
> > > katherinmail@gmail.com>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>> Yes, ok.
> > > >>>>>> let's start some design document, and write down there already
> > > >>>>>> mentioned
> > > >>>>>> ideas about: parameter server, about clipper and others. Would
> be
> > > >>>>>> nice if
> > > >>>>>> we will also map this approaches to cases.
> > > >>>>>> Will work on it collaboratively on each topic, may be finally we
> > > will
> > > >>>>>>
> > > >>>>> form
> > > >>>>>
> > > >>>>>> some picture, that could be agreed with committers.
> > > >>>>>> @Gabor, could you please start such shared doc, as you have
> > already
> > > >>>>>>
> > > >>>>> several
> > > >>>>>
> > > >>>>>> ideas proposed?
> > > >>>>>>
> > > >>>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <ma...@gaborhermann.com>:
> > > >>>>>>
> > > >>>>>> I agree, that it's better to go in one direction first, but I
> > think
> > > >>>>>>> online and offline with streaming API can go somewhat parallel
> > > later.
> > > >>>>>>>
> > > >>>>>> We
> > > >>>>>
> > > >>>>>> could set a short-term goal, concentrate initially on one
> > direction,
> > > >>>>>>>
> > > >>>>>> and
> > > >>>>>
> > > >>>>>> showcase that direction (e.g. in a blogpost). But first, we
> should
> > > >>>>>>> list
> > > >>>>>>> the pros/cons in a design doc as a minimum. Then make a
> decision
> > > what
> > > >>>>>>> direction to go. Would that be feasible?
> > > >>>>>>>
> > > >>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
> > > >>>>>>>
> > > >>>>>>> I'm not sure that this is feasible, doing all at the same time
> > > could
> > > >>>>>>>>
> > > >>>>>>> mean
> > > >>>>>>
> > > >>>>>>> doing nothing((((
> > > >>>>>>>> I'm just afraid, that words: we will work on streaming not on
> > > >>>>>>>>
> > > >>>>>>> batching,
> > > >>>>>
> > > >>>>>> we
> > > >>>>>>>
> > > >>>>>>>> have no commiter's time for this, mean that yes, we started
> work
> > > on
> > > >>>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
> > > >>>>>>>>
> > > >>>>>>> already
> > > >>>>>
> > > >>>>>> was
> > > >>>>>>>
> > > >>>>>>>> with this ticket.
> > > >>>>>>>>
> > > >>>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
> > > >>>>>>>>
> > > >>>>>>> mail@gaborhermann.com>
> > > >>>>>>>
> > > >>>>>>>> написал:
> > > >>>>>>>>
> > > >>>>>>>> @Theodore: Great to hear you think the "batch on streaming"
> > > approach
> > > >>>>>>>>>
> > > >>>>>>>> is
> > > >>>>>>
> > > >>>>>>> possible! Of course, we need to pay attention all the pitfalls
> > > >>>>>>>>>
> > > >>>>>>>> there,
> > > >>>>>
> > > >>>>>> if we
> > > >>>>>>>
> > > >>>>>>>> go that way.
> > > >>>>>>>>>
> > > >>>>>>>>> +1 for a design doc!
> > > >>>>>>>>>
> > > >>>>>>>>> I would add that it's possible to make efforts in all the
> three
> > > >>>>>>>>>
> > > >>>>>>>> directions
> > > >>>>>>>
> > > >>>>>>>> (i.e. batch, online, batch on streaming) at the same time.
> > > Although,
> > > >>>>>>>>>
> > > >>>>>>>> it
> > > >>>>>>
> > > >>>>>>> might be worth to concentrate on one. E.g. it would not be so
> > > useful
> > > >>>>>>>>>
> > > >>>>>>>> to
> > > >>>>>>
> > > >>>>>>> have the same batch algorithms with both the batch API and
> > > streaming
> > > >>>>>>>>>
> > > >>>>>>>> API.
> > > >>>>>>>
> > > >>>>>>>> We can decide later.
> > > >>>>>>>>>
> > > >>>>>>>>> The design doc could be partitioned to these 3 directions,
> and
> > we
> > > >>>>>>>>>
> > > >>>>>>>> can
> > > >>>>>
> > > >>>>>> collect there the pros/cons too. What do you think?
> > > >>>>>>>>>
> > > >>>>>>>>> Cheers,
> > > >>>>>>>>> Gabor
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Hello all,
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> @Gabor, we have discussed the idea of using the streaming
> API
> > to
> > > >>>>>>>>>>
> > > >>>>>>>>> write
> > > >>>>>>
> > > >>>>>>> all
> > > >>>>>>>
> > > >>>>>>>> of our ML algorithms with a couple of people offline,
> > > >>>>>>>>>> and I think it might be possible and is generally worth a
> > shot.
> > > >>>>>>>>>> The
> > > >>>>>>>>>> approach we would take would be close to Vowpal Wabbit, not
> > > >>>>>>>>>> exactly
> > > >>>>>>>>>> "online", but rather "fast-batch".
> > > >>>>>>>>>>
> > > >>>>>>>>>> There will be problems popping up again, even for very
> simple
> > > >>>>>>>>>> algos
> > > >>>>>>>>>>
> > > >>>>>>>>> like
> > > >>>>>>>
> > > >>>>>>>> on
> > > >>>>>>>>>> line linear regression with SGD [1], but hopefully fixing
> > those
> > > >>>>>>>>>>
> > > >>>>>>>>> will
> > > >>>>>
> > > >>>>>> be
> > > >>>>>>
> > > >>>>>>> more aligned with the priorities of the community.
> > > >>>>>>>>>>
> > > >>>>>>>>>> @Katherin, my understanding is that given the limited
> > resources,
> > > >>>>>>>>>>
> > > >>>>>>>>> there
> > > >>>>>>
> > > >>>>>>> is
> > > >>>>>>>
> > > >>>>>>>> no development effort focused on batch processing right now.
> > > >>>>>>>>>>
> > > >>>>>>>>>> So to summarize, it seems like there are people willing to
> > work
> > > on
> > > >>>>>>>>>>
> > > >>>>>>>>> ML
> > > >>>>>
> > > >>>>>> on
> > > >>>>>>>
> > > >>>>>>>> Flink, but nobody is sure how to do it.
> > > >>>>>>>>>> There are many directions we could take (batch, online,
> batch
> > on
> > > >>>>>>>>>> streaming), each with its own merits and downsides.
> > > >>>>>>>>>>
> > > >>>>>>>>>> If you want we can start a design doc and move the
> > conversation
> > > >>>>>>>>>>
> > > >>>>>>>>> there,
> > > >>>>>>
> > > >>>>>>> come
> > > >>>>>>>>>> up with a roadmap and start implementing.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Regards,
> > > >>>>>>>>>> Theodore
> > > >>>>>>>>>>
> > > >>>>>>>>>> [1]
> > > >>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
> > > >>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-
> times
> > > >>>>>>>>>> tamps-td10241.html
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
> > > >>>>>>>>>>
> > > >>>>>>>>> mail@gaborhermann.com
> > > >>>>>>
> > > >>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> It's great to see so much activity in this discussion :)
> > > >>>>>>>>>>
> > > >>>>>>>>>>> I'll try to add my thoughts.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I think building a developer community (Till's 2. point)
> can
> > be
> > > >>>>>>>>>>>
> > > >>>>>>>>>> slightly
> > > >>>>>>>
> > > >>>>>>>> separated from what features we should aim for (1. point) and
> > > >>>>>>>>>>>
> > > >>>>>>>>>> showcasing
> > > >>>>>>>
> > > >>>>>>>> (3. point). Thanks Till for bringing up the ideas for
> > > >>>>>>>>>>>
> > > >>>>>>>>>> restructuring,
> > > >>>>>
> > > >>>>>> I'm
> > > >>>>>>>
> > > >>>>>>>> sure we'll find a way to make the development process more
> > > >>>>>>>>>>>
> > > >>>>>>>>>> dynamic.
> > > >>>>>
> > > >>>>>> I'll
> > > >>>>>>>
> > > >>>>>>>> try to address the rest here.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> It's hard to choose directions between streaming and batch
> > ML.
> > > As
> > > >>>>>>>>>>>
> > > >>>>>>>>>> Theo
> > > >>>>>>
> > > >>>>>>> has
> > > >>>>>>>>>>> indicated, not much online ML is used in production, but
> > Flink
> > > >>>>>>>>>>> concentrates
> > > >>>>>>>>>>> on streaming, so online ML would be a better fit for Flink.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> However,
> > > >>>>>
> > > >>>>>> as
> > > >>>>>>>
> > > >>>>>>>> most of you argued, there's definite need for batch ML. But
> > batch
> > > >>>>>>>>>>>
> > > >>>>>>>>>> ML
> > > >>>>>
> > > >>>>>> seems
> > > >>>>>>>>>>> hard to achieve because there are blocking issues with
> > > >>>>>>>>>>> persisting,
> > > >>>>>>>>>>> iteration paths etc. So it's no good either way.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I propose a seemingly crazy solution: what if we developed
> > > batch
> > > >>>>>>>>>>> algorithms also with the streaming API? The batch API would
> > > >>>>>>>>>>>
> > > >>>>>>>>>> clearly
> > > >>>>>
> > > >>>>>> seem
> > > >>>>>>>
> > > >>>>>>>> more suitable for ML algorithms, but there a lot of benefits
> of
> > > >>>>>>>>>>>
> > > >>>>>>>>>> this
> > > >>>>>
> > > >>>>>> approach too, so it's clearly worth considering. Flink also has
> > > >>>>>>>>>>>
> > > >>>>>>>>>> the
> > > >>>>>
> > > >>>>>> high
> > > >>>>>>>
> > > >>>>>>>> level vision of "streaming for everything" that would clearly
> > fit
> > > >>>>>>>>>>>
> > > >>>>>>>>>> this
> > > >>>>>>
> > > >>>>>>> case. What do you all think about this? Do you think this
> > solution
> > > >>>>>>>>>>>
> > > >>>>>>>>>> would
> > > >>>>>>>
> > > >>>>>>>> be
> > > >>>>>>>>>>> feasible? I would be happy to make a more elaborate
> proposal,
> > > but
> > > >>>>>>>>>>>
> > > >>>>>>>>>> I
> > > >>>>>
> > > >>>>>> push
> > > >>>>>>>
> > > >>>>>>>> my
> > > >>>>>>>>>>> main ideas here:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 1) Simplifying by using one system
> > > >>>>>>>>>>> It could simplify the work of both the users and the
> > > developers.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> One
> > > >>>>>
> > > >>>>>> could
> > > >>>>>>>>>>> execute training once, or could execute it periodically
> e.g.
> > by
> > > >>>>>>>>>>>
> > > >>>>>>>>>> using
> > > >>>>>>
> > > >>>>>>> windows. Low-latency serving and training could be done in the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> same
> > > >>>>>
> > > >>>>>> system.
> > > >>>>>>>>>>> We could implement incremental algorithms, without any side
> > > >>>>>>>>>>> inputs
> > > >>>>>>>>>>>
> > > >>>>>>>>>> for
> > > >>>>>>
> > > >>>>>>> combining online learning (or predictions) with batch learning.
> > Of
> > > >>>>>>>>>>> course,
> > > >>>>>>>>>>> all the logic describing these must be somehow implemented
> > > (e.g.
> > > >>>>>>>>>>> synchronizing predictions with training), but it should be
> > > easier
> > > >>>>>>>>>>>
> > > >>>>>>>>>> to
> > > >>>>>
> > > >>>>>> do
> > > >>>>>>>
> > > >>>>>>>> so
> > > >>>>>>>>>>> in one system, than by combining e.g. the batch and
> streaming
> > > >>>>>>>>>>> API.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 2) Batch ML with the streaming API is not harder
> > > >>>>>>>>>>> Despite these benefits, it could seem harder to implement
> > batch
> > > >>>>>>>>>>> ML
> > > >>>>>>>>>>>
> > > >>>>>>>>>> with
> > > >>>>>>>
> > > >>>>>>>> the streaming API, but in my opinion it's not. There are more
> > > >>>>>>>>>>>
> > > >>>>>>>>>> flexible,
> > > >>>>>>>
> > > >>>>>>>> lower-level optimization potentials with the streaming API.
> Most
> > > >>>>>>>>>>> distributed ML algorithms use a lower-level model than the
> > > batch
> > > >>>>>>>>>>>
> > > >>>>>>>>>> API
> > > >>>>>
> > > >>>>>> anyway, so sometimes it feels like forcing the algorithm logic
> > > >>>>>>>>>>>
> > > >>>>>>>>>> into
> > > >>>>>
> > > >>>>>> the
> > > >>>>>>>
> > > >>>>>>>> training API and tweaking it. Although we could not use the
> > batch
> > > >>>>>>>>>>> primitives like join, we would have the E.g. in my
> experience
> > > >>>>>>>>>>> with
> > > >>>>>>>>>>> implementing a distributed matrix factorization algorithm
> > [1],
> > > I
> > > >>>>>>>>>>>
> > > >>>>>>>>>> couldn't
> > > >>>>>>>
> > > >>>>>>>> do a simple optimization because of the limitations of the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> iteration
> > > >>>>>
> > > >>>>>> API
> > > >>>>>>>
> > > >>>>>>>> [2]. Even if we pushed all the development effort to make the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> batch
> > > >>>>>
> > > >>>>>> API
> > > >>>>>>>
> > > >>>>>>>> more suitable for ML there would be things we couldn't do.
> E.g.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> there
> > > >>>>>>
> > > >>>>>>> are
> > > >>>>>>>
> > > >>>>>>>> approaches for updating a model iteratively without locks
> [3,4]
> > > >>>>>>>>>>>
> > > >>>>>>>>>> (i.e.
> > > >>>>>>
> > > >>>>>>> somewhat asynchronously), and I don't see a clear way to
> > implement
> > > >>>>>>>>>>>
> > > >>>>>>>>>> such
> > > >>>>>>>
> > > >>>>>>>> algorithms with the batch API.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 3) Streaming community (users and devs) benefit
> > > >>>>>>>>>>> The Flink streaming community in general would also benefit
> > > from
> > > >>>>>>>>>>>
> > > >>>>>>>>>> this
> > > >>>>>>
> > > >>>>>>> direction. There are many features needed in the streaming API
> > for
> > > >>>>>>>>>>>
> > > >>>>>>>>>> ML
> > > >>>>>>
> > > >>>>>>> to
> > > >>>>>>>
> > > >>>>>>>> work, but this is also true for the batch API. One really
> > > >>>>>>>>>>>
> > > >>>>>>>>>> important
> > > >>>>>
> > > >>>>>> is
> > > >>>>>>
> > > >>>>>>> the
> > > >>>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has
> been
> > a
> > > >>>>>>>>>>> lot
> > > >>>>>>>>>>>
> > > >>>>>>>>>> of
> > > >>>>>>
> > > >>>>>>> effort (mostly from Paris) for making it mature enough [6].
> Kate
> > > >>>>>>>>>>> mentioned
> > > >>>>>>>>>>> using GPUs, and I'm sure they have uses in streaming
> > generally
> > > >>>>>>>>>>>
> > > >>>>>>>>>> [7].
> > > >>>>>
> > > >>>>>> Thus,
> > > >>>>>>>
> > > >>>>>>>> by improving the streaming API to allow ML algorithms, the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> streaming
> > > >>>>>
> > > >>>>>> API
> > > >>>>>>>
> > > >>>>>>>> benefit too (which is important as they have a lot more
> > production
> > > >>>>>>>>>>>
> > > >>>>>>>>>> users
> > > >>>>>>>
> > > >>>>>>>> than the batch API).
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 4) Performance can be at least as good
> > > >>>>>>>>>>> I believe the same performance could be achieved with the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> streaming
> > > >>>>>
> > > >>>>>> API
> > > >>>>>>>
> > > >>>>>>>> as
> > > >>>>>>>>>>> with the batch API. Streaming API is much closer to the
> > runtime
> > > >>>>>>>>>>>
> > > >>>>>>>>>> than
> > > >>>>>
> > > >>>>>> the
> > > >>>>>>>
> > > >>>>>>>> batch API. For corner-cases, with runtime-layer optimizations
> of
> > > >>>>>>>>>>>
> > > >>>>>>>>>> batch
> > > >>>>>>
> > > >>>>>>> API,
> > > >>>>>>>>>>> we could find a way to do the same (or similar)
> optimization
> > > for
> > > >>>>>>>>>>>
> > > >>>>>>>>>> the
> > > >>>>>
> > > >>>>>> streaming API (see my previous point). Such case could be using
> > > >>>>>>>>>>>
> > > >>>>>>>>>> managed
> > > >>>>>>>
> > > >>>>>>>> memory (and spilling to disk). There are also benefits by
> > default,
> > > >>>>>>>>>>>
> > > >>>>>>>>>> e.g.
> > > >>>>>>>
> > > >>>>>>>> we
> > > >>>>>>>>>>> would have a finer grained fault tolerance with the
> streaming
> > > >>>>>>>>>>> API.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 5) We could keep batch ML API
> > > >>>>>>>>>>> For the shorter term, we should not throw away all the
> > > algorithms
> > > >>>>>>>>>>> implemented with the batch API. By pushing forward the
> > > >>>>>>>>>>> development
> > > >>>>>>>>>>>
> > > >>>>>>>>>> with
> > > >>>>>>>
> > > >>>>>>>> side inputs we could make them usable with streaming API.
> Then,
> > if
> > > >>>>>>>>>>>
> > > >>>>>>>>>> the
> > > >>>>>>
> > > >>>>>>> library gains some popularity, we could replace the algorithms
> in
> > > >>>>>>>>>>>
> > > >>>>>>>>>> the
> > > >>>>>>
> > > >>>>>>> batch
> > > >>>>>>>>>>> API with streaming ones, to avoid the performance costs of
> > e.g.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> not
> > > >>>>>
> > > >>>>>> being
> > > >>>>>>>
> > > >>>>>>>> able to persist.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 6) General tools for implementing ML algorithms
> > > >>>>>>>>>>> Besides implementing algorithms one by one, we could give
> > more
> > > >>>>>>>>>>>
> > > >>>>>>>>>> general
> > > >>>>>>
> > > >>>>>>> tools for making it easier to implement algorithms. E.g.
> > parameter
> > > >>>>>>>>>>>
> > > >>>>>>>>>> server
> > > >>>>>>>
> > > >>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow
> > has a
> > > >>>>>>>>>>> similar
> > > >>>>>>>>>>> model to Flink streaming, we could look into that too. I
> > think
> > > >>>>>>>>>>>
> > > >>>>>>>>>> often
> > > >>>>>
> > > >>>>>> when
> > > >>>>>>>
> > > >>>>>>>> deploying a production ML system, much more configuration and
> > > >>>>>>>>>>>
> > > >>>>>>>>>> tweaking
> > > >>>>>>
> > > >>>>>>> should be done than e.g. Spark MLlib allows. Why not allow
> that?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 7) Showcasing
> > > >>>>>>>>>>> Showcasing this could be easier. We could say that we're
> > doing
> > > >>>>>>>>>>>
> > > >>>>>>>>>> batch
> > > >>>>>
> > > >>>>>> ML
> > > >>>>>>>
> > > >>>>>>>> with a streaming API. That's interesting in its own. IMHO this
> > > >>>>>>>>>>> integration
> > > >>>>>>>>>>> is also a more approachable way towards end-to-end ML.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks for reading so far :)
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
> > > >>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
> > > >>>>>>>>>>> [3] https://people.eecs.berkeley.
> > edu/~brecht/papers/hogwildTR.
> > > pd
> > > >>>>>>>>>>> f
> > > >>>>>>>>>>> [4] https://www.usenix.org/system/
> > > files/conference/hotos13/hotos
> > > >>>>>>>>>>> 13-final77.pdf
> > > >>>>>>>>>>> [5] https://cwiki.apache.org/
> confluence/display/FLINK/FLIP-
> > 15+
> > > >>>>>>>>>>> Scoped+Loops+and+Job+Termination
> > > >>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
> > > >>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-
> > > sigmod16.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> pdf
> > > >>>>>
> > > >>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.
> pdf
> > > >>>>>>>>>>> [9] http://apache-flink-mailing-
> > list-archive.1008284.n3.nabble
> > > .
> > > >>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
> > > >>>>>>>>>>> Parameter-Server-implementation-td15880.html
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Cheers,
> > > >>>>>>>>>>> Gabor
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> --
> > > >>>>>>>
> > > >>>>>> *Yours faithfully, *
> > > >>>>>>
> > > >>>>>> *Kate Eri.*
> > > >>>>>>
> > > >>>>>>
> > > >>
> > > >
> > >
> > >
> > > --
> > > Roberto Bentivoglio
> > > CTO
> > > e. roberto.bentivoglio@radicalbit.io
> > > Radicalbit S.r.l.
> > > radicalbit.io
> > >
> >
>

Re: [DISCUSS] Flink ML roadmap

Posted by Stavros Kontopoulos <st...@gmail.com>.

Interesting  thanx @Roberto.  I see that only TAP Analytics Toolkit
supports streaming. I am not aware of its market share, anyone?

Best,
Stavros

On Fri, Mar 3, 2017 at 11:50 AM, Theodore Vasiloudis <
theodoros.vasiloudis@gmail.com> wrote:

> Thank you for the links Roberto I did not know that Beam was working on an
> ML abstraction as well. I'm sure we can learn from that.
>
> I'll start another thread today where we can discuss next steps and action
> points now that we have a few different paths to follow listed on the
> shared doc,
> since our deadline was today. We welcome further discussions of course.
>
> Regards,
> Theodore
>
> On Thu, Mar 2, 2017 at 10:52 AM, Roberto Bentivoglio <
> roberto.bentivoglio@radicalbit.io> wrote:
>
> > Hi All,
> >
> > First of all I'd like to introduce myself: my name is Roberto Bentivoglio
> > and I'm currently working for Radicalbit as Andrea Spina (he already
> wrote
> > on this thread).
> > I didn't have the chance to directly contribute on Flink up to now, but
> > some colleagues of mine are doing that since at least one year (they
> > contributed also on the machine learning library).
> >
> > I hope I'm not jumping into discussione too late, it's really interesting
> > and the analysis document is depicting really well the scenarios
> currently
> > available. Many thanks for your effort!
> >
> > If I can add my two cents to the discussion I'd like to add the
> following:
> >  - it's clear that currently the Flink community is deeply focused on
> > streaming features than batch features. For this reason I think that
> > implement "Offline learning with Streaming API" is really a great idea.
> >  - I think that the "Online learning" option is really a good fit for
> > Flink, but maybe we could give at the beginning an higher priority to the
> > "Offline learning with Streaming API" option. However I think that this
> > option will be the main goal for the mid/long term.
> >  - we implemented a library based on jpmml-evaluator[1] and flink called
> > "flink-jpmml". Using this library you can train the models on external
> > systems and use those models, after you've exported in a PMML standard
> > format, to run evaluations on top of DataStream API. We don't have open
> > sourced this library up to now, but we're planning to do this in the next
> > weeks. We'd like to complete the documentation and the final code reviews
> > before to share it. I hope it will be helpful for the community to
> enhance
> > the ML support on Flink
> >  - I'd like also to mention that the Apache Beam community is thiking on
> a
> > ML DSL. There is a design document and a couple of Jira tasks for that
> > [2][3]
> >
> > We're really keen to focus our effort to improve the ML support on Flink
> in
> > Radicalbit, we will contribute on this effort for sure on a regular basis
> > with our team.
> >
> > Looking forward to work with you!
> >
> > Many thanks,
> > Roberto
> >
> > [1] - https://github.com/jpmml/jpmml-evaluator
> > [2] -
> > https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1yjo4PB
> > ECHb-xA
> > [3] - https://issues.apache.org/jira/browse/BEAM-303
> >
> > On 28 February 2017 at 19:35, Gábor Hermann <ma...@gaborhermann.com>
> wrote:
> >
> > > Hi Philipp,
> > >
> > > It's great to hear you are interested in Flink ML!
> > >
> > > Based on your description, your prototype seems like an interesting
> > > approach for combining online+offline learning. If you're interested,
> we
> > > might find a way to integrate your work, or at least your ideas, into
> > Flink
> > > ML if we decide on a direction that fits your approach. I think your
> work
> > > could be relevant for almost all the directions listed there (if I
> > > understand correctly you'd even like to serve predictions on unlabeled
> > > data).
> > >
> > > Feel free to join the discussion in the docs you've mentioned :)
> > >
> > > Cheers,
> > > Gabor
> > >
> > >
> > > On 2017-02-27 18:39, Philipp Zehnder wrote:
> > >
> > > Hello all,
> > >>
> > >> I’m new to this mailing list and I wanted to introduce myself. My name
> > is
> > >> Philipp Zehnder and I’m a Masters Student in Computer Science at the
> > >> Karlsruhe Institute of Technology in Germany currently writing on my
> > >> master’s thesis with the main goal to integrate reusable machine
> > learning
> > >> components into a stream processing network. One part of my thesis is
> to
> > >> create an API for distributed online machine learning.
> > >>
> > >> I saw that there are some recent discussions how to continue the
> > >> development of Flink ML [1] and I want to share some of my experiences
> > and
> > >> maybe get some feedback from the community for my ideas.
> > >>
> > >> As I am new to open source projects I hope this is the right place for
> > >> this.
> > >>
> > >> In the beginning, I had a look at different already existing
> frameworks
> > >> like Apache SAMOA for example, which is great and has a lot of useful
> > >> resources. However, as Flink is currently focusing on streaming, from
> my
> > >> point of view it makes sense to also have a streaming machine learning
> > API
> > >> as part of the Flink ecosystem.
> > >>
> > >> I’m currently working on building a prototype for a distributed
> > streaming
> > >> machine learning library based on Flink that can be used for online
> and
> > >> “classical” offline learning.
> > >>
> > >> The machine learning algorithm takes labeled and non-labeled data. On
> a
> > >> labeled data point first a prediction is performed and then this label
> > is
> > >> used to train the model. On a non-labeled data point just a prediction
> > is
> > >> performed. The main difference between the online and offline
> > algorithms is
> > >> that in the offline case the labeled data must be handed to the model
> > >> before the unlabeled data. In the online case, it is still possible to
> > >> process labeled data at a later point to update the model. The
> > advantage of
> > >> this approach is that batch algorithms can be applied on streaming
> data
> > as
> > >> well as online algorithms can be supported.
> > >>
> > >> One difference to batch learning are the transformers that are used to
> > >> preprocess the data. For example, a simple mean subtraction must be
> > >> implemented with a rolling mean, because we can’t calculate the mean
> > over
> > >> all the data, but the Flink Streaming API is perfect for that. It
> would
> > be
> > >> useful for users to have an extensible toolbox of transformers.
> > >>
> > >> Another difference is the evaluation of the models. As we don’t have a
> > >> single value to determine the model quality, in streaming scenarios
> this
> > >> value evolves over time when it sees more labeled data.
> > >>
> > >> However, the transformation and evaluation works again similar in both
> > >> online learning and offline learning.
> > >>
> > >> I also liked the discussion in [2] and I think that the competition in
> > >> the batch learning field is hard and there are already a lot of great
> > >> projects. I think it is true that in most real world problems it is
> not
> > >> necessary to update the model immediately, but there are a lot of use
> > cases
> > >> for machine learning on streams. For them it would be nice to have a
> > native
> > >> approach.
> > >>
> > >> A stream machine learning API for Flink would fit very well and I
> would
> > >> also be willing to contribute to the future development of the Flink
> ML
> > >> library.
> > >>
> > >>
> > >>
> > >> Best regards,
> > >>
> > >> Philipp
> > >>
> > >> [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > >> com/DISCUSS-Flink-ML-roadmap-td16040.html <
> > http://apache-flink-mailing-l
> > >> ist-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-
> td16040.html
> > >
> > >> [2] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> > >> 49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2 <
> > >> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQ
> > >> c49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2>
> > >>
> > >>
> > >> Am 23.02.2017 um 15:48 schrieb Gábor Hermann <ma...@gaborhermann.com>:
> > >>>
> > >>> Okay, I've created a skeleton of the design doc for choosing a
> > direction:
> > >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> > >>> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
> > >>>
> > >>> Much of the pros/cons have already been discussed here, so I'll try
> to
> > >>> put there all the arguments mentioned in this thread. Feel free to
> put
> > >>> there more :)
> > >>>
> > >>> @Stavros: I agree we should take action fast. What about collecting
> our
> > >>> thoughts in the doc by around Tuesday next week (28. February)? Then
> > decide
> > >>> on the direction and design a roadmap by around Friday (3. March)? Is
> > that
> > >>> feasible, or should it take more time?
> > >>>
> > >>> I think it will be necessary to have a shepherd, or even better a
> > >>> committer, to be involved in at least reviewing and accepting the
> > roadmap.
> > >>> It would be best, if a committer coordinated all this.
> > >>> @Theodore: Would you like to do the coordination?
> > >>>
> > >>> Regarding the use-cases: I've seen some abstracts of talks at SF
> Flink
> > >>> Forward [1] that seem promising. There are companies already using
> > Flink
> > >>> for ML [2,3,4,5].
> > >>>
> > >>> [1] http://sf.flink-forward.org/program/sessions/
> > >>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
> > >>> eaming-vs-micro-batch-for-online-learning/
> > >>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-te
> > >>> nsorflow/
> > >>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
> > >>> arning-on-flink/
> > >>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
> > >>> ing-scenarios-with-flink/
> > >>>
> > >>> Cheers,
> > >>> Gabor
> > >>>
> > >>>
> > >>> On 2017-02-23 15:19, Katherin Eri wrote:
> > >>>
> > >>>> I have asked already some teams for useful cases, but all of them
> need
> > >>>> time
> > >>>> to think.
> > >>>> During analysis something will finally arise.
> > >>>> May be we can ask partners of Flink  for cases? Data Artisans got
> > >>>> results
> > >>>> of customers survey: [1], ML better support is wanted, so we could
> ask
> > >>>> what
> > >>>> exactly is necessary.
> > >>>>
> > >>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
> > >>>>
> > >>>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
> > >>>> st.kontopoulos@gmail.com> написал:
> > >>>>
> > >>>> +100 for a design doc.
> > >>>>>
> > >>>>> Could we also set a roadmap after some time-boxed investigation
> > >>>>> captured in
> > >>>>> that document? We need action.
> > >>>>>
> > >>>>> Looking forward to work on this (whatever that might be) ;) Also
> are
> > >>>>> there
> > >>>>> any data supporting one direction or the other from a customer
> > >>>>> perspective?
> > >>>>> It would help to make more informed decisions.
> > >>>>>
> > >>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <
> > katherinmail@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>> Yes, ok.
> > >>>>>> let's start some design document, and write down there already
> > >>>>>> mentioned
> > >>>>>> ideas about: parameter server, about clipper and others. Would be
> > >>>>>> nice if
> > >>>>>> we will also map this approaches to cases.
> > >>>>>> Will work on it collaboratively on each topic, may be finally we
> > will
> > >>>>>>
> > >>>>> form
> > >>>>>
> > >>>>>> some picture, that could be agreed with committers.
> > >>>>>> @Gabor, could you please start such shared doc, as you have
> already
> > >>>>>>
> > >>>>> several
> > >>>>>
> > >>>>>> ideas proposed?
> > >>>>>>
> > >>>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <ma...@gaborhermann.com>:
> > >>>>>>
> > >>>>>> I agree, that it's better to go in one direction first, but I
> think
> > >>>>>>> online and offline with streaming API can go somewhat parallel
> > later.
> > >>>>>>>
> > >>>>>> We
> > >>>>>
> > >>>>>> could set a short-term goal, concentrate initially on one
> direction,
> > >>>>>>>
> > >>>>>> and
> > >>>>>
> > >>>>>> showcase that direction (e.g. in a blogpost). But first, we should
> > >>>>>>> list
> > >>>>>>> the pros/cons in a design doc as a minimum. Then make a decision
> > what
> > >>>>>>> direction to go. Would that be feasible?
> > >>>>>>>
> > >>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
> > >>>>>>>
> > >>>>>>> I'm not sure that this is feasible, doing all at the same time
> > could
> > >>>>>>>>
> > >>>>>>> mean
> > >>>>>>
> > >>>>>>> doing nothing((((
> > >>>>>>>> I'm just afraid, that words: we will work on streaming not on
> > >>>>>>>>
> > >>>>>>> batching,
> > >>>>>
> > >>>>>> we
> > >>>>>>>
> > >>>>>>>> have no commiter's time for this, mean that yes, we started work
> > on
> > >>>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
> > >>>>>>>>
> > >>>>>>> already
> > >>>>>
> > >>>>>> was
> > >>>>>>>
> > >>>>>>>> with this ticket.
> > >>>>>>>>
> > >>>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
> > >>>>>>>>
> > >>>>>>> mail@gaborhermann.com>
> > >>>>>>>
> > >>>>>>>> написал:
> > >>>>>>>>
> > >>>>>>>> @Theodore: Great to hear you think the "batch on streaming"
> > approach
> > >>>>>>>>>
> > >>>>>>>> is
> > >>>>>>
> > >>>>>>> possible! Of course, we need to pay attention all the pitfalls
> > >>>>>>>>>
> > >>>>>>>> there,
> > >>>>>
> > >>>>>> if we
> > >>>>>>>
> > >>>>>>>> go that way.
> > >>>>>>>>>
> > >>>>>>>>> +1 for a design doc!
> > >>>>>>>>>
> > >>>>>>>>> I would add that it's possible to make efforts in all the three
> > >>>>>>>>>
> > >>>>>>>> directions
> > >>>>>>>
> > >>>>>>>> (i.e. batch, online, batch on streaming) at the same time.
> > Although,
> > >>>>>>>>>
> > >>>>>>>> it
> > >>>>>>
> > >>>>>>> might be worth to concentrate on one. E.g. it would not be so
> > useful
> > >>>>>>>>>
> > >>>>>>>> to
> > >>>>>>
> > >>>>>>> have the same batch algorithms with both the batch API and
> > streaming
> > >>>>>>>>>
> > >>>>>>>> API.
> > >>>>>>>
> > >>>>>>>> We can decide later.
> > >>>>>>>>>
> > >>>>>>>>> The design doc could be partitioned to these 3 directions, and
> we
> > >>>>>>>>>
> > >>>>>>>> can
> > >>>>>
> > >>>>>> collect there the pros/cons too. What do you think?
> > >>>>>>>>>
> > >>>>>>>>> Cheers,
> > >>>>>>>>> Gabor
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hello all,
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> @Gabor, we have discussed the idea of using the streaming API
> to
> > >>>>>>>>>>
> > >>>>>>>>> write
> > >>>>>>
> > >>>>>>> all
> > >>>>>>>
> > >>>>>>>> of our ML algorithms with a couple of people offline,
> > >>>>>>>>>> and I think it might be possible and is generally worth a
> shot.
> > >>>>>>>>>> The
> > >>>>>>>>>> approach we would take would be close to Vowpal Wabbit, not
> > >>>>>>>>>> exactly
> > >>>>>>>>>> "online", but rather "fast-batch".
> > >>>>>>>>>>
> > >>>>>>>>>> There will be problems popping up again, even for very simple
> > >>>>>>>>>> algos
> > >>>>>>>>>>
> > >>>>>>>>> like
> > >>>>>>>
> > >>>>>>>> on
> > >>>>>>>>>> line linear regression with SGD [1], but hopefully fixing
> those
> > >>>>>>>>>>
> > >>>>>>>>> will
> > >>>>>
> > >>>>>> be
> > >>>>>>
> > >>>>>>> more aligned with the priorities of the community.
> > >>>>>>>>>>
> > >>>>>>>>>> @Katherin, my understanding is that given the limited
> resources,
> > >>>>>>>>>>
> > >>>>>>>>> there
> > >>>>>>
> > >>>>>>> is
> > >>>>>>>
> > >>>>>>>> no development effort focused on batch processing right now.
> > >>>>>>>>>>
> > >>>>>>>>>> So to summarize, it seems like there are people willing to
> work
> > on
> > >>>>>>>>>>
> > >>>>>>>>> ML
> > >>>>>
> > >>>>>> on
> > >>>>>>>
> > >>>>>>>> Flink, but nobody is sure how to do it.
> > >>>>>>>>>> There are many directions we could take (batch, online, batch
> on
> > >>>>>>>>>> streaming), each with its own merits and downsides.
> > >>>>>>>>>>
> > >>>>>>>>>> If you want we can start a design doc and move the
> conversation
> > >>>>>>>>>>
> > >>>>>>>>> there,
> > >>>>>>
> > >>>>>>> come
> > >>>>>>>>>> up with a roadmap and start implementing.
> > >>>>>>>>>>
> > >>>>>>>>>> Regards,
> > >>>>>>>>>> Theodore
> > >>>>>>>>>>
> > >>>>>>>>>> [1]
> > >>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
> > >>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
> > >>>>>>>>>> tamps-td10241.html
> > >>>>>>>>>>
> > >>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
> > >>>>>>>>>>
> > >>>>>>>>> mail@gaborhermann.com
> > >>>>>>
> > >>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> It's great to see so much activity in this discussion :)
> > >>>>>>>>>>
> > >>>>>>>>>>> I'll try to add my thoughts.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I think building a developer community (Till's 2. point) can
> be
> > >>>>>>>>>>>
> > >>>>>>>>>> slightly
> > >>>>>>>
> > >>>>>>>> separated from what features we should aim for (1. point) and
> > >>>>>>>>>>>
> > >>>>>>>>>> showcasing
> > >>>>>>>
> > >>>>>>>> (3. point). Thanks Till for bringing up the ideas for
> > >>>>>>>>>>>
> > >>>>>>>>>> restructuring,
> > >>>>>
> > >>>>>> I'm
> > >>>>>>>
> > >>>>>>>> sure we'll find a way to make the development process more
> > >>>>>>>>>>>
> > >>>>>>>>>> dynamic.
> > >>>>>
> > >>>>>> I'll
> > >>>>>>>
> > >>>>>>>> try to address the rest here.
> > >>>>>>>>>>>
> > >>>>>>>>>>> It's hard to choose directions between streaming and batch
> ML.
> > As
> > >>>>>>>>>>>
> > >>>>>>>>>> Theo
> > >>>>>>
> > >>>>>>> has
> > >>>>>>>>>>> indicated, not much online ML is used in production, but
> Flink
> > >>>>>>>>>>> concentrates
> > >>>>>>>>>>> on streaming, so online ML would be a better fit for Flink.
> > >>>>>>>>>>>
> > >>>>>>>>>> However,
> > >>>>>
> > >>>>>> as
> > >>>>>>>
> > >>>>>>>> most of you argued, there's definite need for batch ML. But
> batch
> > >>>>>>>>>>>
> > >>>>>>>>>> ML
> > >>>>>
> > >>>>>> seems
> > >>>>>>>>>>> hard to achieve because there are blocking issues with
> > >>>>>>>>>>> persisting,
> > >>>>>>>>>>> iteration paths etc. So it's no good either way.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I propose a seemingly crazy solution: what if we developed
> > batch
> > >>>>>>>>>>> algorithms also with the streaming API? The batch API would
> > >>>>>>>>>>>
> > >>>>>>>>>> clearly
> > >>>>>
> > >>>>>> seem
> > >>>>>>>
> > >>>>>>>> more suitable for ML algorithms, but there a lot of benefits of
> > >>>>>>>>>>>
> > >>>>>>>>>> this
> > >>>>>
> > >>>>>> approach too, so it's clearly worth considering. Flink also has
> > >>>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>
> > >>>>>> high
> > >>>>>>>
> > >>>>>>>> level vision of "streaming for everything" that would clearly
> fit
> > >>>>>>>>>>>
> > >>>>>>>>>> this
> > >>>>>>
> > >>>>>>> case. What do you all think about this? Do you think this
> solution
> > >>>>>>>>>>>
> > >>>>>>>>>> would
> > >>>>>>>
> > >>>>>>>> be
> > >>>>>>>>>>> feasible? I would be happy to make a more elaborate proposal,
> > but
> > >>>>>>>>>>>
> > >>>>>>>>>> I
> > >>>>>
> > >>>>>> push
> > >>>>>>>
> > >>>>>>>> my
> > >>>>>>>>>>> main ideas here:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1) Simplifying by using one system
> > >>>>>>>>>>> It could simplify the work of both the users and the
> > developers.
> > >>>>>>>>>>>
> > >>>>>>>>>> One
> > >>>>>
> > >>>>>> could
> > >>>>>>>>>>> execute training once, or could execute it periodically e.g.
> by
> > >>>>>>>>>>>
> > >>>>>>>>>> using
> > >>>>>>
> > >>>>>>> windows. Low-latency serving and training could be done in the
> > >>>>>>>>>>>
> > >>>>>>>>>> same
> > >>>>>
> > >>>>>> system.
> > >>>>>>>>>>> We could implement incremental algorithms, without any side
> > >>>>>>>>>>> inputs
> > >>>>>>>>>>>
> > >>>>>>>>>> for
> > >>>>>>
> > >>>>>>> combining online learning (or predictions) with batch learning.
> Of
> > >>>>>>>>>>> course,
> > >>>>>>>>>>> all the logic describing these must be somehow implemented
> > (e.g.
> > >>>>>>>>>>> synchronizing predictions with training), but it should be
> > easier
> > >>>>>>>>>>>
> > >>>>>>>>>> to
> > >>>>>
> > >>>>>> do
> > >>>>>>>
> > >>>>>>>> so
> > >>>>>>>>>>> in one system, than by combining e.g. the batch and streaming
> > >>>>>>>>>>> API.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2) Batch ML with the streaming API is not harder
> > >>>>>>>>>>> Despite these benefits, it could seem harder to implement
> batch
> > >>>>>>>>>>> ML
> > >>>>>>>>>>>
> > >>>>>>>>>> with
> > >>>>>>>
> > >>>>>>>> the streaming API, but in my opinion it's not. There are more
> > >>>>>>>>>>>
> > >>>>>>>>>> flexible,
> > >>>>>>>
> > >>>>>>>> lower-level optimization potentials with the streaming API. Most
> > >>>>>>>>>>> distributed ML algorithms use a lower-level model than the
> > batch
> > >>>>>>>>>>>
> > >>>>>>>>>> API
> > >>>>>
> > >>>>>> anyway, so sometimes it feels like forcing the algorithm logic
> > >>>>>>>>>>>
> > >>>>>>>>>> into
> > >>>>>
> > >>>>>> the
> > >>>>>>>
> > >>>>>>>> training API and tweaking it. Although we could not use the
> batch
> > >>>>>>>>>>> primitives like join, we would have the E.g. in my experience
> > >>>>>>>>>>> with
> > >>>>>>>>>>> implementing a distributed matrix factorization algorithm
> [1],
> > I
> > >>>>>>>>>>>
> > >>>>>>>>>> couldn't
> > >>>>>>>
> > >>>>>>>> do a simple optimization because of the limitations of the
> > >>>>>>>>>>>
> > >>>>>>>>>> iteration
> > >>>>>
> > >>>>>> API
> > >>>>>>>
> > >>>>>>>> [2]. Even if we pushed all the development effort to make the
> > >>>>>>>>>>>
> > >>>>>>>>>> batch
> > >>>>>
> > >>>>>> API
> > >>>>>>>
> > >>>>>>>> more suitable for ML there would be things we couldn't do. E.g.
> > >>>>>>>>>>>
> > >>>>>>>>>> there
> > >>>>>>
> > >>>>>>> are
> > >>>>>>>
> > >>>>>>>> approaches for updating a model iteratively without locks [3,4]
> > >>>>>>>>>>>
> > >>>>>>>>>> (i.e.
> > >>>>>>
> > >>>>>>> somewhat asynchronously), and I don't see a clear way to
> implement
> > >>>>>>>>>>>
> > >>>>>>>>>> such
> > >>>>>>>
> > >>>>>>>> algorithms with the batch API.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 3) Streaming community (users and devs) benefit
> > >>>>>>>>>>> The Flink streaming community in general would also benefit
> > from
> > >>>>>>>>>>>
> > >>>>>>>>>> this
> > >>>>>>
> > >>>>>>> direction. There are many features needed in the streaming API
> for
> > >>>>>>>>>>>
> > >>>>>>>>>> ML
> > >>>>>>
> > >>>>>>> to
> > >>>>>>>
> > >>>>>>>> work, but this is also true for the batch API. One really
> > >>>>>>>>>>>
> > >>>>>>>>>> important
> > >>>>>
> > >>>>>> is
> > >>>>>>
> > >>>>>>> the
> > >>>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been
> a
> > >>>>>>>>>>> lot
> > >>>>>>>>>>>
> > >>>>>>>>>> of
> > >>>>>>
> > >>>>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
> > >>>>>>>>>>> mentioned
> > >>>>>>>>>>> using GPUs, and I'm sure they have uses in streaming
> generally
> > >>>>>>>>>>>
> > >>>>>>>>>> [7].
> > >>>>>
> > >>>>>> Thus,
> > >>>>>>>
> > >>>>>>>> by improving the streaming API to allow ML algorithms, the
> > >>>>>>>>>>>
> > >>>>>>>>>> streaming
> > >>>>>
> > >>>>>> API
> > >>>>>>>
> > >>>>>>>> benefit too (which is important as they have a lot more
> production
> > >>>>>>>>>>>
> > >>>>>>>>>> users
> > >>>>>>>
> > >>>>>>>> than the batch API).
> > >>>>>>>>>>>
> > >>>>>>>>>>> 4) Performance can be at least as good
> > >>>>>>>>>>> I believe the same performance could be achieved with the
> > >>>>>>>>>>>
> > >>>>>>>>>> streaming
> > >>>>>
> > >>>>>> API
> > >>>>>>>
> > >>>>>>>> as
> > >>>>>>>>>>> with the batch API. Streaming API is much closer to the
> runtime
> > >>>>>>>>>>>
> > >>>>>>>>>> than
> > >>>>>
> > >>>>>> the
> > >>>>>>>
> > >>>>>>>> batch API. For corner-cases, with runtime-layer optimizations of
> > >>>>>>>>>>>
> > >>>>>>>>>> batch
> > >>>>>>
> > >>>>>>> API,
> > >>>>>>>>>>> we could find a way to do the same (or similar) optimization
> > for
> > >>>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>
> > >>>>>> streaming API (see my previous point). Such case could be using
> > >>>>>>>>>>>
> > >>>>>>>>>> managed
> > >>>>>>>
> > >>>>>>>> memory (and spilling to disk). There are also benefits by
> default,
> > >>>>>>>>>>>
> > >>>>>>>>>> e.g.
> > >>>>>>>
> > >>>>>>>> we
> > >>>>>>>>>>> would have a finer grained fault tolerance with the streaming
> > >>>>>>>>>>> API.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 5) We could keep batch ML API
> > >>>>>>>>>>> For the shorter term, we should not throw away all the
> > algorithms
> > >>>>>>>>>>> implemented with the batch API. By pushing forward the
> > >>>>>>>>>>> development
> > >>>>>>>>>>>
> > >>>>>>>>>> with
> > >>>>>>>
> > >>>>>>>> side inputs we could make them usable with streaming API. Then,
> if
> > >>>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>>
> > >>>>>>> library gains some popularity, we could replace the algorithms in
> > >>>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>>
> > >>>>>>> batch
> > >>>>>>>>>>> API with streaming ones, to avoid the performance costs of
> e.g.
> > >>>>>>>>>>>
> > >>>>>>>>>> not
> > >>>>>
> > >>>>>> being
> > >>>>>>>
> > >>>>>>>> able to persist.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 6) General tools for implementing ML algorithms
> > >>>>>>>>>>> Besides implementing algorithms one by one, we could give
> more
> > >>>>>>>>>>>
> > >>>>>>>>>> general
> > >>>>>>
> > >>>>>>> tools for making it easier to implement algorithms. E.g.
> parameter
> > >>>>>>>>>>>
> > >>>>>>>>>> server
> > >>>>>>>
> > >>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow
> has a
> > >>>>>>>>>>> similar
> > >>>>>>>>>>> model to Flink streaming, we could look into that too. I
> think
> > >>>>>>>>>>>
> > >>>>>>>>>> often
> > >>>>>
> > >>>>>> when
> > >>>>>>>
> > >>>>>>>> deploying a production ML system, much more configuration and
> > >>>>>>>>>>>
> > >>>>>>>>>> tweaking
> > >>>>>>
> > >>>>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
> > >>>>>>>>>>>
> > >>>>>>>>>>> 7) Showcasing
> > >>>>>>>>>>> Showcasing this could be easier. We could say that we're
> doing
> > >>>>>>>>>>>
> > >>>>>>>>>> batch
> > >>>>>
> > >>>>>> ML
> > >>>>>>>
> > >>>>>>>> with a streaming API. That's interesting in its own. IMHO this
> > >>>>>>>>>>> integration
> > >>>>>>>>>>> is also a more approachable way towards end-to-end ML.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for reading so far :)
> > >>>>>>>>>>>
> > >>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
> > >>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
> > >>>>>>>>>>> [3] https://people.eecs.berkeley.
> edu/~brecht/papers/hogwildTR.
> > pd
> > >>>>>>>>>>> f
> > >>>>>>>>>>> [4] https://www.usenix.org/system/
> > files/conference/hotos13/hotos
> > >>>>>>>>>>> 13-final77.pdf
> > >>>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-
> 15+
> > >>>>>>>>>>> Scoped+Loops+and+Job+Termination
> > >>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
> > >>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-
> > sigmod16.
> > >>>>>>>>>>>
> > >>>>>>>>>> pdf
> > >>>>>
> > >>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
> > >>>>>>>>>>> [9] http://apache-flink-mailing-
> list-archive.1008284.n3.nabble
> > .
> > >>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
> > >>>>>>>>>>> Parameter-Server-implementation-td15880.html
> > >>>>>>>>>>>
> > >>>>>>>>>>> Cheers,
> > >>>>>>>>>>> Gabor
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>
> > >>>>>> *Yours faithfully, *
> > >>>>>>
> > >>>>>> *Kate Eri.*
> > >>>>>>
> > >>>>>>
> > >>
> > >
> >
> >
> > --
> > Roberto Bentivoglio
> > CTO
> > e. roberto.bentivoglio@radicalbit.io
> > Radicalbit S.r.l.
> > radicalbit.io
> >
>

Re: [DISCUSS] Flink ML roadmap

Posted by Theodore Vasiloudis <th...@gmail.com>.

Thank you for the links Roberto I did not know that Beam was working on an
ML abstraction as well. I'm sure we can learn from that.

I'll start another thread today where we can discuss next steps and action
points now that we have a few different paths to follow listed on the
shared doc,
since our deadline was today. We welcome further discussions of course.

Regards,
Theodore

On Thu, Mar 2, 2017 at 10:52 AM, Roberto Bentivoglio <
roberto.bentivoglio@radicalbit.io> wrote:

> Hi All,
>
> First of all I'd like to introduce myself: my name is Roberto Bentivoglio
> and I'm currently working for Radicalbit as Andrea Spina (he already wrote
> on this thread).
> I didn't have the chance to directly contribute on Flink up to now, but
> some colleagues of mine are doing that since at least one year (they
> contributed also on the machine learning library).
>
> I hope I'm not jumping into discussione too late, it's really interesting
> and the analysis document is depicting really well the scenarios currently
> available. Many thanks for your effort!
>
> If I can add my two cents to the discussion I'd like to add the following:
>  - it's clear that currently the Flink community is deeply focused on
> streaming features than batch features. For this reason I think that
> implement "Offline learning with Streaming API" is really a great idea.
>  - I think that the "Online learning" option is really a good fit for
> Flink, but maybe we could give at the beginning an higher priority to the
> "Offline learning with Streaming API" option. However I think that this
> option will be the main goal for the mid/long term.
>  - we implemented a library based on jpmml-evaluator[1] and flink called
> "flink-jpmml". Using this library you can train the models on external
> systems and use those models, after you've exported in a PMML standard
> format, to run evaluations on top of DataStream API. We don't have open
> sourced this library up to now, but we're planning to do this in the next
> weeks. We'd like to complete the documentation and the final code reviews
> before to share it. I hope it will be helpful for the community to enhance
> the ML support on Flink
>  - I'd like also to mention that the Apache Beam community is thiking on a
> ML DSL. There is a design document and a couple of Jira tasks for that
> [2][3]
>
> We're really keen to focus our effort to improve the ML support on Flink in
> Radicalbit, we will contribute on this effort for sure on a regular basis
> with our team.
>
> Looking forward to work with you!
>
> Many thanks,
> Roberto
>
> [1] - https://github.com/jpmml/jpmml-evaluator
> [2] -
> https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1yjo4PB
> ECHb-xA
> [3] - https://issues.apache.org/jira/browse/BEAM-303
>
> On 28 February 2017 at 19:35, Gábor Hermann <ma...@gaborhermann.com> wrote:
>
> > Hi Philipp,
> >
> > It's great to hear you are interested in Flink ML!
> >
> > Based on your description, your prototype seems like an interesting
> > approach for combining online+offline learning. If you're interested, we
> > might find a way to integrate your work, or at least your ideas, into
> Flink
> > ML if we decide on a direction that fits your approach. I think your work
> > could be relevant for almost all the directions listed there (if I
> > understand correctly you'd even like to serve predictions on unlabeled
> > data).
> >
> > Feel free to join the discussion in the docs you've mentioned :)
> >
> > Cheers,
> > Gabor
> >
> >
> > On 2017-02-27 18:39, Philipp Zehnder wrote:
> >
> > Hello all,
> >>
> >> I’m new to this mailing list and I wanted to introduce myself. My name
> is
> >> Philipp Zehnder and I’m a Masters Student in Computer Science at the
> >> Karlsruhe Institute of Technology in Germany currently writing on my
> >> master’s thesis with the main goal to integrate reusable machine
> learning
> >> components into a stream processing network. One part of my thesis is to
> >> create an API for distributed online machine learning.
> >>
> >> I saw that there are some recent discussions how to continue the
> >> development of Flink ML [1] and I want to share some of my experiences
> and
> >> maybe get some feedback from the community for my ideas.
> >>
> >> As I am new to open source projects I hope this is the right place for
> >> this.
> >>
> >> In the beginning, I had a look at different already existing frameworks
> >> like Apache SAMOA for example, which is great and has a lot of useful
> >> resources. However, as Flink is currently focusing on streaming, from my
> >> point of view it makes sense to also have a streaming machine learning
> API
> >> as part of the Flink ecosystem.
> >>
> >> I’m currently working on building a prototype for a distributed
> streaming
> >> machine learning library based on Flink that can be used for online and
> >> “classical” offline learning.
> >>
> >> The machine learning algorithm takes labeled and non-labeled data. On a
> >> labeled data point first a prediction is performed and then this label
> is
> >> used to train the model. On a non-labeled data point just a prediction
> is
> >> performed. The main difference between the online and offline
> algorithms is
> >> that in the offline case the labeled data must be handed to the model
> >> before the unlabeled data. In the online case, it is still possible to
> >> process labeled data at a later point to update the model. The
> advantage of
> >> this approach is that batch algorithms can be applied on streaming data
> as
> >> well as online algorithms can be supported.
> >>
> >> One difference to batch learning are the transformers that are used to
> >> preprocess the data. For example, a simple mean subtraction must be
> >> implemented with a rolling mean, because we can’t calculate the mean
> over
> >> all the data, but the Flink Streaming API is perfect for that. It would
> be
> >> useful for users to have an extensible toolbox of transformers.
> >>
> >> Another difference is the evaluation of the models. As we don’t have a
> >> single value to determine the model quality, in streaming scenarios this
> >> value evolves over time when it sees more labeled data.
> >>
> >> However, the transformation and evaluation works again similar in both
> >> online learning and offline learning.
> >>
> >> I also liked the discussion in [2] and I think that the competition in
> >> the batch learning field is hard and there are already a lot of great
> >> projects. I think it is true that in most real world problems it is not
> >> necessary to update the model immediately, but there are a lot of use
> cases
> >> for machine learning on streams. For them it would be nice to have a
> native
> >> approach.
> >>
> >> A stream machine learning API for Flink would fit very well and I would
> >> also be willing to contribute to the future development of the Flink ML
> >> library.
> >>
> >>
> >>
> >> Best regards,
> >>
> >> Philipp
> >>
> >> [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> >> com/DISCUSS-Flink-ML-roadmap-td16040.html <
> http://apache-flink-mailing-l
> >> ist-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-td16040.html
> >
> >> [2] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> >> 49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2 <
> >> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQ
> >> c49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2>
> >>
> >>
> >> Am 23.02.2017 um 15:48 schrieb Gábor Hermann <ma...@gaborhermann.com>:
> >>>
> >>> Okay, I've created a skeleton of the design doc for choosing a
> direction:
> >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> >>> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
> >>>
> >>> Much of the pros/cons have already been discussed here, so I'll try to
> >>> put there all the arguments mentioned in this thread. Feel free to put
> >>> there more :)
> >>>
> >>> @Stavros: I agree we should take action fast. What about collecting our
> >>> thoughts in the doc by around Tuesday next week (28. February)? Then
> decide
> >>> on the direction and design a roadmap by around Friday (3. March)? Is
> that
> >>> feasible, or should it take more time?
> >>>
> >>> I think it will be necessary to have a shepherd, or even better a
> >>> committer, to be involved in at least reviewing and accepting the
> roadmap.
> >>> It would be best, if a committer coordinated all this.
> >>> @Theodore: Would you like to do the coordination?
> >>>
> >>> Regarding the use-cases: I've seen some abstracts of talks at SF Flink
> >>> Forward [1] that seem promising. There are companies already using
> Flink
> >>> for ML [2,3,4,5].
> >>>
> >>> [1] http://sf.flink-forward.org/program/sessions/
> >>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
> >>> eaming-vs-micro-batch-for-online-learning/
> >>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-te
> >>> nsorflow/
> >>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
> >>> arning-on-flink/
> >>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
> >>> ing-scenarios-with-flink/
> >>>
> >>> Cheers,
> >>> Gabor
> >>>
> >>>
> >>> On 2017-02-23 15:19, Katherin Eri wrote:
> >>>
> >>>> I have asked already some teams for useful cases, but all of them need
> >>>> time
> >>>> to think.
> >>>> During analysis something will finally arise.
> >>>> May be we can ask partners of Flink  for cases? Data Artisans got
> >>>> results
> >>>> of customers survey: [1], ML better support is wanted, so we could ask
> >>>> what
> >>>> exactly is necessary.
> >>>>
> >>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
> >>>>
> >>>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
> >>>> st.kontopoulos@gmail.com> написал:
> >>>>
> >>>> +100 for a design doc.
> >>>>>
> >>>>> Could we also set a roadmap after some time-boxed investigation
> >>>>> captured in
> >>>>> that document? We need action.
> >>>>>
> >>>>> Looking forward to work on this (whatever that might be) ;) Also are
> >>>>> there
> >>>>> any data supporting one direction or the other from a customer
> >>>>> perspective?
> >>>>> It would help to make more informed decisions.
> >>>>>
> >>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <
> katherinmail@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>> Yes, ok.
> >>>>>> let's start some design document, and write down there already
> >>>>>> mentioned
> >>>>>> ideas about: parameter server, about clipper and others. Would be
> >>>>>> nice if
> >>>>>> we will also map this approaches to cases.
> >>>>>> Will work on it collaboratively on each topic, may be finally we
> will
> >>>>>>
> >>>>> form
> >>>>>
> >>>>>> some picture, that could be agreed with committers.
> >>>>>> @Gabor, could you please start such shared doc, as you have already
> >>>>>>
> >>>>> several
> >>>>>
> >>>>>> ideas proposed?
> >>>>>>
> >>>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <ma...@gaborhermann.com>:
> >>>>>>
> >>>>>> I agree, that it's better to go in one direction first, but I think
> >>>>>>> online and offline with streaming API can go somewhat parallel
> later.
> >>>>>>>
> >>>>>> We
> >>>>>
> >>>>>> could set a short-term goal, concentrate initially on one direction,
> >>>>>>>
> >>>>>> and
> >>>>>
> >>>>>> showcase that direction (e.g. in a blogpost). But first, we should
> >>>>>>> list
> >>>>>>> the pros/cons in a design doc as a minimum. Then make a decision
> what
> >>>>>>> direction to go. Would that be feasible?
> >>>>>>>
> >>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
> >>>>>>>
> >>>>>>> I'm not sure that this is feasible, doing all at the same time
> could
> >>>>>>>>
> >>>>>>> mean
> >>>>>>
> >>>>>>> doing nothing((((
> >>>>>>>> I'm just afraid, that words: we will work on streaming not on
> >>>>>>>>
> >>>>>>> batching,
> >>>>>
> >>>>>> we
> >>>>>>>
> >>>>>>>> have no commiter's time for this, mean that yes, we started work
> on
> >>>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
> >>>>>>>>
> >>>>>>> already
> >>>>>
> >>>>>> was
> >>>>>>>
> >>>>>>>> with this ticket.
> >>>>>>>>
> >>>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
> >>>>>>>>
> >>>>>>> mail@gaborhermann.com>
> >>>>>>>
> >>>>>>>> написал:
> >>>>>>>>
> >>>>>>>> @Theodore: Great to hear you think the "batch on streaming"
> approach
> >>>>>>>>>
> >>>>>>>> is
> >>>>>>
> >>>>>>> possible! Of course, we need to pay attention all the pitfalls
> >>>>>>>>>
> >>>>>>>> there,
> >>>>>
> >>>>>> if we
> >>>>>>>
> >>>>>>>> go that way.
> >>>>>>>>>
> >>>>>>>>> +1 for a design doc!
> >>>>>>>>>
> >>>>>>>>> I would add that it's possible to make efforts in all the three
> >>>>>>>>>
> >>>>>>>> directions
> >>>>>>>
> >>>>>>>> (i.e. batch, online, batch on streaming) at the same time.
> Although,
> >>>>>>>>>
> >>>>>>>> it
> >>>>>>
> >>>>>>> might be worth to concentrate on one. E.g. it would not be so
> useful
> >>>>>>>>>
> >>>>>>>> to
> >>>>>>
> >>>>>>> have the same batch algorithms with both the batch API and
> streaming
> >>>>>>>>>
> >>>>>>>> API.
> >>>>>>>
> >>>>>>>> We can decide later.
> >>>>>>>>>
> >>>>>>>>> The design doc could be partitioned to these 3 directions, and we
> >>>>>>>>>
> >>>>>>>> can
> >>>>>
> >>>>>> collect there the pros/cons too. What do you think?
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>> Gabor
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
> >>>>>>>>>
> >>>>>>>>> Hello all,
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> @Gabor, we have discussed the idea of using the streaming API to
> >>>>>>>>>>
> >>>>>>>>> write
> >>>>>>
> >>>>>>> all
> >>>>>>>
> >>>>>>>> of our ML algorithms with a couple of people offline,
> >>>>>>>>>> and I think it might be possible and is generally worth a shot.
> >>>>>>>>>> The
> >>>>>>>>>> approach we would take would be close to Vowpal Wabbit, not
> >>>>>>>>>> exactly
> >>>>>>>>>> "online", but rather "fast-batch".
> >>>>>>>>>>
> >>>>>>>>>> There will be problems popping up again, even for very simple
> >>>>>>>>>> algos
> >>>>>>>>>>
> >>>>>>>>> like
> >>>>>>>
> >>>>>>>> on
> >>>>>>>>>> line linear regression with SGD [1], but hopefully fixing those
> >>>>>>>>>>
> >>>>>>>>> will
> >>>>>
> >>>>>> be
> >>>>>>
> >>>>>>> more aligned with the priorities of the community.
> >>>>>>>>>>
> >>>>>>>>>> @Katherin, my understanding is that given the limited resources,
> >>>>>>>>>>
> >>>>>>>>> there
> >>>>>>
> >>>>>>> is
> >>>>>>>
> >>>>>>>> no development effort focused on batch processing right now.
> >>>>>>>>>>
> >>>>>>>>>> So to summarize, it seems like there are people willing to work
> on
> >>>>>>>>>>
> >>>>>>>>> ML
> >>>>>
> >>>>>> on
> >>>>>>>
> >>>>>>>> Flink, but nobody is sure how to do it.
> >>>>>>>>>> There are many directions we could take (batch, online, batch on
> >>>>>>>>>> streaming), each with its own merits and downsides.
> >>>>>>>>>>
> >>>>>>>>>> If you want we can start a design doc and move the conversation
> >>>>>>>>>>
> >>>>>>>>> there,
> >>>>>>
> >>>>>>> come
> >>>>>>>>>> up with a roadmap and start implementing.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Theodore
> >>>>>>>>>>
> >>>>>>>>>> [1]
> >>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
> >>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
> >>>>>>>>>> tamps-td10241.html
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
> >>>>>>>>>>
> >>>>>>>>> mail@gaborhermann.com
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> It's great to see so much activity in this discussion :)
> >>>>>>>>>>
> >>>>>>>>>>> I'll try to add my thoughts.
> >>>>>>>>>>>
> >>>>>>>>>>> I think building a developer community (Till's 2. point) can be
> >>>>>>>>>>>
> >>>>>>>>>> slightly
> >>>>>>>
> >>>>>>>> separated from what features we should aim for (1. point) and
> >>>>>>>>>>>
> >>>>>>>>>> showcasing
> >>>>>>>
> >>>>>>>> (3. point). Thanks Till for bringing up the ideas for
> >>>>>>>>>>>
> >>>>>>>>>> restructuring,
> >>>>>
> >>>>>> I'm
> >>>>>>>
> >>>>>>>> sure we'll find a way to make the development process more
> >>>>>>>>>>>
> >>>>>>>>>> dynamic.
> >>>>>
> >>>>>> I'll
> >>>>>>>
> >>>>>>>> try to address the rest here.
> >>>>>>>>>>>
> >>>>>>>>>>> It's hard to choose directions between streaming and batch ML.
> As
> >>>>>>>>>>>
> >>>>>>>>>> Theo
> >>>>>>
> >>>>>>> has
> >>>>>>>>>>> indicated, not much online ML is used in production, but Flink
> >>>>>>>>>>> concentrates
> >>>>>>>>>>> on streaming, so online ML would be a better fit for Flink.
> >>>>>>>>>>>
> >>>>>>>>>> However,
> >>>>>
> >>>>>> as
> >>>>>>>
> >>>>>>>> most of you argued, there's definite need for batch ML. But batch
> >>>>>>>>>>>
> >>>>>>>>>> ML
> >>>>>
> >>>>>> seems
> >>>>>>>>>>> hard to achieve because there are blocking issues with
> >>>>>>>>>>> persisting,
> >>>>>>>>>>> iteration paths etc. So it's no good either way.
> >>>>>>>>>>>
> >>>>>>>>>>> I propose a seemingly crazy solution: what if we developed
> batch
> >>>>>>>>>>> algorithms also with the streaming API? The batch API would
> >>>>>>>>>>>
> >>>>>>>>>> clearly
> >>>>>
> >>>>>> seem
> >>>>>>>
> >>>>>>>> more suitable for ML algorithms, but there a lot of benefits of
> >>>>>>>>>>>
> >>>>>>>>>> this
> >>>>>
> >>>>>> approach too, so it's clearly worth considering. Flink also has
> >>>>>>>>>>>
> >>>>>>>>>> the
> >>>>>
> >>>>>> high
> >>>>>>>
> >>>>>>>> level vision of "streaming for everything" that would clearly fit
> >>>>>>>>>>>
> >>>>>>>>>> this
> >>>>>>
> >>>>>>> case. What do you all think about this? Do you think this solution
> >>>>>>>>>>>
> >>>>>>>>>> would
> >>>>>>>
> >>>>>>>> be
> >>>>>>>>>>> feasible? I would be happy to make a more elaborate proposal,
> but
> >>>>>>>>>>>
> >>>>>>>>>> I
> >>>>>
> >>>>>> push
> >>>>>>>
> >>>>>>>> my
> >>>>>>>>>>> main ideas here:
> >>>>>>>>>>>
> >>>>>>>>>>> 1) Simplifying by using one system
> >>>>>>>>>>> It could simplify the work of both the users and the
> developers.
> >>>>>>>>>>>
> >>>>>>>>>> One
> >>>>>
> >>>>>> could
> >>>>>>>>>>> execute training once, or could execute it periodically e.g. by
> >>>>>>>>>>>
> >>>>>>>>>> using
> >>>>>>
> >>>>>>> windows. Low-latency serving and training could be done in the
> >>>>>>>>>>>
> >>>>>>>>>> same
> >>>>>
> >>>>>> system.
> >>>>>>>>>>> We could implement incremental algorithms, without any side
> >>>>>>>>>>> inputs
> >>>>>>>>>>>
> >>>>>>>>>> for
> >>>>>>
> >>>>>>> combining online learning (or predictions) with batch learning. Of
> >>>>>>>>>>> course,
> >>>>>>>>>>> all the logic describing these must be somehow implemented
> (e.g.
> >>>>>>>>>>> synchronizing predictions with training), but it should be
> easier
> >>>>>>>>>>>
> >>>>>>>>>> to
> >>>>>
> >>>>>> do
> >>>>>>>
> >>>>>>>> so
> >>>>>>>>>>> in one system, than by combining e.g. the batch and streaming
> >>>>>>>>>>> API.
> >>>>>>>>>>>
> >>>>>>>>>>> 2) Batch ML with the streaming API is not harder
> >>>>>>>>>>> Despite these benefits, it could seem harder to implement batch
> >>>>>>>>>>> ML
> >>>>>>>>>>>
> >>>>>>>>>> with
> >>>>>>>
> >>>>>>>> the streaming API, but in my opinion it's not. There are more
> >>>>>>>>>>>
> >>>>>>>>>> flexible,
> >>>>>>>
> >>>>>>>> lower-level optimization potentials with the streaming API. Most
> >>>>>>>>>>> distributed ML algorithms use a lower-level model than the
> batch
> >>>>>>>>>>>
> >>>>>>>>>> API
> >>>>>
> >>>>>> anyway, so sometimes it feels like forcing the algorithm logic
> >>>>>>>>>>>
> >>>>>>>>>> into
> >>>>>
> >>>>>> the
> >>>>>>>
> >>>>>>>> training API and tweaking it. Although we could not use the batch
> >>>>>>>>>>> primitives like join, we would have the E.g. in my experience
> >>>>>>>>>>> with
> >>>>>>>>>>> implementing a distributed matrix factorization algorithm [1],
> I
> >>>>>>>>>>>
> >>>>>>>>>> couldn't
> >>>>>>>
> >>>>>>>> do a simple optimization because of the limitations of the
> >>>>>>>>>>>
> >>>>>>>>>> iteration
> >>>>>
> >>>>>> API
> >>>>>>>
> >>>>>>>> [2]. Even if we pushed all the development effort to make the
> >>>>>>>>>>>
> >>>>>>>>>> batch
> >>>>>
> >>>>>> API
> >>>>>>>
> >>>>>>>> more suitable for ML there would be things we couldn't do. E.g.
> >>>>>>>>>>>
> >>>>>>>>>> there
> >>>>>>
> >>>>>>> are
> >>>>>>>
> >>>>>>>> approaches for updating a model iteratively without locks [3,4]
> >>>>>>>>>>>
> >>>>>>>>>> (i.e.
> >>>>>>
> >>>>>>> somewhat asynchronously), and I don't see a clear way to implement
> >>>>>>>>>>>
> >>>>>>>>>> such
> >>>>>>>
> >>>>>>>> algorithms with the batch API.
> >>>>>>>>>>>
> >>>>>>>>>>> 3) Streaming community (users and devs) benefit
> >>>>>>>>>>> The Flink streaming community in general would also benefit
> from
> >>>>>>>>>>>
> >>>>>>>>>> this
> >>>>>>
> >>>>>>> direction. There are many features needed in the streaming API for
> >>>>>>>>>>>
> >>>>>>>>>> ML
> >>>>>>
> >>>>>>> to
> >>>>>>>
> >>>>>>>> work, but this is also true for the batch API. One really
> >>>>>>>>>>>
> >>>>>>>>>> important
> >>>>>
> >>>>>> is
> >>>>>>
> >>>>>>> the
> >>>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a
> >>>>>>>>>>> lot
> >>>>>>>>>>>
> >>>>>>>>>> of
> >>>>>>
> >>>>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
> >>>>>>>>>>> mentioned
> >>>>>>>>>>> using GPUs, and I'm sure they have uses in streaming generally
> >>>>>>>>>>>
> >>>>>>>>>> [7].
> >>>>>
> >>>>>> Thus,
> >>>>>>>
> >>>>>>>> by improving the streaming API to allow ML algorithms, the
> >>>>>>>>>>>
> >>>>>>>>>> streaming
> >>>>>
> >>>>>> API
> >>>>>>>
> >>>>>>>> benefit too (which is important as they have a lot more production
> >>>>>>>>>>>
> >>>>>>>>>> users
> >>>>>>>
> >>>>>>>> than the batch API).
> >>>>>>>>>>>
> >>>>>>>>>>> 4) Performance can be at least as good
> >>>>>>>>>>> I believe the same performance could be achieved with the
> >>>>>>>>>>>
> >>>>>>>>>> streaming
> >>>>>
> >>>>>> API
> >>>>>>>
> >>>>>>>> as
> >>>>>>>>>>> with the batch API. Streaming API is much closer to the runtime
> >>>>>>>>>>>
> >>>>>>>>>> than
> >>>>>
> >>>>>> the
> >>>>>>>
> >>>>>>>> batch API. For corner-cases, with runtime-layer optimizations of
> >>>>>>>>>>>
> >>>>>>>>>> batch
> >>>>>>
> >>>>>>> API,
> >>>>>>>>>>> we could find a way to do the same (or similar) optimization
> for
> >>>>>>>>>>>
> >>>>>>>>>> the
> >>>>>
> >>>>>> streaming API (see my previous point). Such case could be using
> >>>>>>>>>>>
> >>>>>>>>>> managed
> >>>>>>>
> >>>>>>>> memory (and spilling to disk). There are also benefits by default,
> >>>>>>>>>>>
> >>>>>>>>>> e.g.
> >>>>>>>
> >>>>>>>> we
> >>>>>>>>>>> would have a finer grained fault tolerance with the streaming
> >>>>>>>>>>> API.
> >>>>>>>>>>>
> >>>>>>>>>>> 5) We could keep batch ML API
> >>>>>>>>>>> For the shorter term, we should not throw away all the
> algorithms
> >>>>>>>>>>> implemented with the batch API. By pushing forward the
> >>>>>>>>>>> development
> >>>>>>>>>>>
> >>>>>>>>>> with
> >>>>>>>
> >>>>>>>> side inputs we could make them usable with streaming API. Then, if
> >>>>>>>>>>>
> >>>>>>>>>> the
> >>>>>>
> >>>>>>> library gains some popularity, we could replace the algorithms in
> >>>>>>>>>>>
> >>>>>>>>>> the
> >>>>>>
> >>>>>>> batch
> >>>>>>>>>>> API with streaming ones, to avoid the performance costs of e.g.
> >>>>>>>>>>>
> >>>>>>>>>> not
> >>>>>
> >>>>>> being
> >>>>>>>
> >>>>>>>> able to persist.
> >>>>>>>>>>>
> >>>>>>>>>>> 6) General tools for implementing ML algorithms
> >>>>>>>>>>> Besides implementing algorithms one by one, we could give more
> >>>>>>>>>>>
> >>>>>>>>>> general
> >>>>>>
> >>>>>>> tools for making it easier to implement algorithms. E.g. parameter
> >>>>>>>>>>>
> >>>>>>>>>> server
> >>>>>>>
> >>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
> >>>>>>>>>>> similar
> >>>>>>>>>>> model to Flink streaming, we could look into that too. I think
> >>>>>>>>>>>
> >>>>>>>>>> often
> >>>>>
> >>>>>> when
> >>>>>>>
> >>>>>>>> deploying a production ML system, much more configuration and
> >>>>>>>>>>>
> >>>>>>>>>> tweaking
> >>>>>>
> >>>>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
> >>>>>>>>>>>
> >>>>>>>>>>> 7) Showcasing
> >>>>>>>>>>> Showcasing this could be easier. We could say that we're doing
> >>>>>>>>>>>
> >>>>>>>>>> batch
> >>>>>
> >>>>>> ML
> >>>>>>>
> >>>>>>>> with a streaming API. That's interesting in its own. IMHO this
> >>>>>>>>>>> integration
> >>>>>>>>>>> is also a more approachable way towards end-to-end ML.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for reading so far :)
> >>>>>>>>>>>
> >>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
> >>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
> >>>>>>>>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.
> pd
> >>>>>>>>>>> f
> >>>>>>>>>>> [4] https://www.usenix.org/system/
> files/conference/hotos13/hotos
> >>>>>>>>>>> 13-final77.pdf
> >>>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
> >>>>>>>>>>> Scoped+Loops+and+Job+Termination
> >>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
> >>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-
> sigmod16.
> >>>>>>>>>>>
> >>>>>>>>>> pdf
> >>>>>
> >>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
> >>>>>>>>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble
> .
> >>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
> >>>>>>>>>>> Parameter-Server-implementation-td15880.html
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>> Gabor
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>
> >>>>>> *Yours faithfully, *
> >>>>>>
> >>>>>> *Kate Eri.*
> >>>>>>
> >>>>>>
> >>
> >
>
>
> --
> Roberto Bentivoglio
> CTO
> e. roberto.bentivoglio@radicalbit.io
> Radicalbit S.r.l.
> radicalbit.io
>

Re: [DISCUSS] Flink ML roadmap

Posted by Roberto Bentivoglio <ro...@radicalbit.io>.

Hi All,

First of all I'd like to introduce myself: my name is Roberto Bentivoglio
and I'm currently working for Radicalbit as Andrea Spina (he already wrote
on this thread).
I didn't have the chance to directly contribute on Flink up to now, but
some colleagues of mine are doing that since at least one year (they
contributed also on the machine learning library).

I hope I'm not jumping into discussione too late, it's really interesting
and the analysis document is depicting really well the scenarios currently
available. Many thanks for your effort!

If I can add my two cents to the discussion I'd like to add the following:
 - it's clear that currently the Flink community is deeply focused on
streaming features than batch features. For this reason I think that
implement "Offline learning with Streaming API" is really a great idea.
 - I think that the "Online learning" option is really a good fit for
Flink, but maybe we could give at the beginning an higher priority to the
"Offline learning with Streaming API" option. However I think that this
option will be the main goal for the mid/long term.
 - we implemented a library based on jpmml-evaluator[1] and flink called
"flink-jpmml". Using this library you can train the models on external
systems and use those models, after you've exported in a PMML standard
format, to run evaluations on top of DataStream API. We don't have open
sourced this library up to now, but we're planning to do this in the next
weeks. We'd like to complete the documentation and the final code reviews
before to share it. I hope it will be helpful for the community to enhance
the ML support on Flink
 - I'd like also to mention that the Apache Beam community is thiking on a
ML DSL. There is a design document and a couple of Jira tasks for that
[2][3]

We're really keen to focus our effort to improve the ML support on Flink in
Radicalbit, we will contribute on this effort for sure on a regular basis
with our team.

Looking forward to work with you!

Many thanks,
Roberto

[1] - https://github.com/jpmml/jpmml-evaluator
[2] -
https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1yjo4PBECHb-xA
[3] - https://issues.apache.org/jira/browse/BEAM-303

On 28 February 2017 at 19:35, Gábor Hermann <ma...@gaborhermann.com> wrote:

> Hi Philipp,
>
> It's great to hear you are interested in Flink ML!
>
> Based on your description, your prototype seems like an interesting
> approach for combining online+offline learning. If you're interested, we
> might find a way to integrate your work, or at least your ideas, into Flink
> ML if we decide on a direction that fits your approach. I think your work
> could be relevant for almost all the directions listed there (if I
> understand correctly you'd even like to serve predictions on unlabeled
> data).
>
> Feel free to join the discussion in the docs you've mentioned :)
>
> Cheers,
> Gabor
>
>
> On 2017-02-27 18:39, Philipp Zehnder wrote:
>
> Hello all,
>>
>> I’m new to this mailing list and I wanted to introduce myself. My name is
>> Philipp Zehnder and I’m a Masters Student in Computer Science at the
>> Karlsruhe Institute of Technology in Germany currently writing on my
>> master’s thesis with the main goal to integrate reusable machine learning
>> components into a stream processing network. One part of my thesis is to
>> create an API for distributed online machine learning.
>>
>> I saw that there are some recent discussions how to continue the
>> development of Flink ML [1] and I want to share some of my experiences and
>> maybe get some feedback from the community for my ideas.
>>
>> As I am new to open source projects I hope this is the right place for
>> this.
>>
>> In the beginning, I had a look at different already existing frameworks
>> like Apache SAMOA for example, which is great and has a lot of useful
>> resources. However, as Flink is currently focusing on streaming, from my
>> point of view it makes sense to also have a streaming machine learning API
>> as part of the Flink ecosystem.
>>
>> I’m currently working on building a prototype for a distributed streaming
>> machine learning library based on Flink that can be used for online and
>> “classical” offline learning.
>>
>> The machine learning algorithm takes labeled and non-labeled data. On a
>> labeled data point first a prediction is performed and then this label is
>> used to train the model. On a non-labeled data point just a prediction is
>> performed. The main difference between the online and offline algorithms is
>> that in the offline case the labeled data must be handed to the model
>> before the unlabeled data. In the online case, it is still possible to
>> process labeled data at a later point to update the model. The advantage of
>> this approach is that batch algorithms can be applied on streaming data as
>> well as online algorithms can be supported.
>>
>> One difference to batch learning are the transformers that are used to
>> preprocess the data. For example, a simple mean subtraction must be
>> implemented with a rolling mean, because we can’t calculate the mean over
>> all the data, but the Flink Streaming API is perfect for that. It would be
>> useful for users to have an extensible toolbox of transformers.
>>
>> Another difference is the evaluation of the models. As we don’t have a
>> single value to determine the model quality, in streaming scenarios this
>> value evolves over time when it sees more labeled data.
>>
>> However, the transformation and evaluation works again similar in both
>> online learning and offline learning.
>>
>> I also liked the discussion in [2] and I think that the competition in
>> the batch learning field is hard and there are already a lot of great
>> projects. I think it is true that in most real world problems it is not
>> necessary to update the model immediately, but there are a lot of use cases
>> for machine learning on streams. For them it would be nice to have a native
>> approach.
>>
>> A stream machine learning API for Flink would fit very well and I would
>> also be willing to contribute to the future development of the Flink ML
>> library.
>>
>>
>>
>> Best regards,
>>
>> Philipp
>>
>> [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>> com/DISCUSS-Flink-ML-roadmap-td16040.html <http://apache-flink-mailing-l
>> ist-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-td16040.html>
>> [2] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
>> 49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2 <
>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQ
>> c49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2>
>>
>>
>> Am 23.02.2017 um 15:48 schrieb Gábor Hermann <ma...@gaborhermann.com>:
>>>
>>> Okay, I've created a skeleton of the design doc for choosing a direction:
>>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
>>> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
>>>
>>> Much of the pros/cons have already been discussed here, so I'll try to
>>> put there all the arguments mentioned in this thread. Feel free to put
>>> there more :)
>>>
>>> @Stavros: I agree we should take action fast. What about collecting our
>>> thoughts in the doc by around Tuesday next week (28. February)? Then decide
>>> on the direction and design a roadmap by around Friday (3. March)? Is that
>>> feasible, or should it take more time?
>>>
>>> I think it will be necessary to have a shepherd, or even better a
>>> committer, to be involved in at least reviewing and accepting the roadmap.
>>> It would be best, if a committer coordinated all this.
>>> @Theodore: Would you like to do the coordination?
>>>
>>> Regarding the use-cases: I've seen some abstracts of talks at SF Flink
>>> Forward [1] that seem promising. There are companies already using Flink
>>> for ML [2,3,4,5].
>>>
>>> [1] http://sf.flink-forward.org/program/sessions/
>>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
>>> eaming-vs-micro-batch-for-online-learning/
>>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-te
>>> nsorflow/
>>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
>>> arning-on-flink/
>>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
>>> ing-scenarios-with-flink/
>>>
>>> Cheers,
>>> Gabor
>>>
>>>
>>> On 2017-02-23 15:19, Katherin Eri wrote:
>>>
>>>> I have asked already some teams for useful cases, but all of them need
>>>> time
>>>> to think.
>>>> During analysis something will finally arise.
>>>> May be we can ask partners of Flink  for cases? Data Artisans got
>>>> results
>>>> of customers survey: [1], ML better support is wanted, so we could ask
>>>> what
>>>> exactly is necessary.
>>>>
>>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
>>>>
>>>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
>>>> st.kontopoulos@gmail.com> написал:
>>>>
>>>> +100 for a design doc.
>>>>>
>>>>> Could we also set a roadmap after some time-boxed investigation
>>>>> captured in
>>>>> that document? We need action.
>>>>>
>>>>> Looking forward to work on this (whatever that might be) ;) Also are
>>>>> there
>>>>> any data supporting one direction or the other from a customer
>>>>> perspective?
>>>>> It would help to make more informed decisions.
>>>>>
>>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <ka...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Yes, ok.
>>>>>> let's start some design document, and write down there already
>>>>>> mentioned
>>>>>> ideas about: parameter server, about clipper and others. Would be
>>>>>> nice if
>>>>>> we will also map this approaches to cases.
>>>>>> Will work on it collaboratively on each topic, may be finally we will
>>>>>>
>>>>> form
>>>>>
>>>>>> some picture, that could be agreed with committers.
>>>>>> @Gabor, could you please start such shared doc, as you have already
>>>>>>
>>>>> several
>>>>>
>>>>>> ideas proposed?
>>>>>>
>>>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <ma...@gaborhermann.com>:
>>>>>>
>>>>>> I agree, that it's better to go in one direction first, but I think
>>>>>>> online and offline with streaming API can go somewhat parallel later.
>>>>>>>
>>>>>> We
>>>>>
>>>>>> could set a short-term goal, concentrate initially on one direction,
>>>>>>>
>>>>>> and
>>>>>
>>>>>> showcase that direction (e.g. in a blogpost). But first, we should
>>>>>>> list
>>>>>>> the pros/cons in a design doc as a minimum. Then make a decision what
>>>>>>> direction to go. Would that be feasible?
>>>>>>>
>>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
>>>>>>>
>>>>>>> I'm not sure that this is feasible, doing all at the same time could
>>>>>>>>
>>>>>>> mean
>>>>>>
>>>>>>> doing nothing((((
>>>>>>>> I'm just afraid, that words: we will work on streaming not on
>>>>>>>>
>>>>>>> batching,
>>>>>
>>>>>> we
>>>>>>>
>>>>>>>> have no commiter's time for this, mean that yes, we started work on
>>>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
>>>>>>>>
>>>>>>> already
>>>>>
>>>>>> was
>>>>>>>
>>>>>>>> with this ticket.
>>>>>>>>
>>>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
>>>>>>>>
>>>>>>> mail@gaborhermann.com>
>>>>>>>
>>>>>>>> написал:
>>>>>>>>
>>>>>>>> @Theodore: Great to hear you think the "batch on streaming" approach
>>>>>>>>>
>>>>>>>> is
>>>>>>
>>>>>>> possible! Of course, we need to pay attention all the pitfalls
>>>>>>>>>
>>>>>>>> there,
>>>>>
>>>>>> if we
>>>>>>>
>>>>>>>> go that way.
>>>>>>>>>
>>>>>>>>> +1 for a design doc!
>>>>>>>>>
>>>>>>>>> I would add that it's possible to make efforts in all the three
>>>>>>>>>
>>>>>>>> directions
>>>>>>>
>>>>>>>> (i.e. batch, online, batch on streaming) at the same time. Although,
>>>>>>>>>
>>>>>>>> it
>>>>>>
>>>>>>> might be worth to concentrate on one. E.g. it would not be so useful
>>>>>>>>>
>>>>>>>> to
>>>>>>
>>>>>>> have the same batch algorithms with both the batch API and streaming
>>>>>>>>>
>>>>>>>> API.
>>>>>>>
>>>>>>>> We can decide later.
>>>>>>>>>
>>>>>>>>> The design doc could be partitioned to these 3 directions, and we
>>>>>>>>>
>>>>>>>> can
>>>>>
>>>>>> collect there the pros/cons too. What do you think?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Gabor
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>>>>>>>>>
>>>>>>>>> Hello all,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> @Gabor, we have discussed the idea of using the streaming API to
>>>>>>>>>>
>>>>>>>>> write
>>>>>>
>>>>>>> all
>>>>>>>
>>>>>>>> of our ML algorithms with a couple of people offline,
>>>>>>>>>> and I think it might be possible and is generally worth a shot.
>>>>>>>>>> The
>>>>>>>>>> approach we would take would be close to Vowpal Wabbit, not
>>>>>>>>>> exactly
>>>>>>>>>> "online", but rather "fast-batch".
>>>>>>>>>>
>>>>>>>>>> There will be problems popping up again, even for very simple
>>>>>>>>>> algos
>>>>>>>>>>
>>>>>>>>> like
>>>>>>>
>>>>>>>> on
>>>>>>>>>> line linear regression with SGD [1], but hopefully fixing those
>>>>>>>>>>
>>>>>>>>> will
>>>>>
>>>>>> be
>>>>>>
>>>>>>> more aligned with the priorities of the community.
>>>>>>>>>>
>>>>>>>>>> @Katherin, my understanding is that given the limited resources,
>>>>>>>>>>
>>>>>>>>> there
>>>>>>
>>>>>>> is
>>>>>>>
>>>>>>>> no development effort focused on batch processing right now.
>>>>>>>>>>
>>>>>>>>>> So to summarize, it seems like there are people willing to work on
>>>>>>>>>>
>>>>>>>>> ML
>>>>>
>>>>>> on
>>>>>>>
>>>>>>>> Flink, but nobody is sure how to do it.
>>>>>>>>>> There are many directions we could take (batch, online, batch on
>>>>>>>>>> streaming), each with its own merits and downsides.
>>>>>>>>>>
>>>>>>>>>> If you want we can start a design doc and move the conversation
>>>>>>>>>>
>>>>>>>>> there,
>>>>>>
>>>>>>> come
>>>>>>>>>> up with a roadmap and start implementing.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Theodore
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
>>>>>>>>>> tamps-td10241.html
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
>>>>>>>>>>
>>>>>>>>> mail@gaborhermann.com
>>>>>>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> It's great to see so much activity in this discussion :)
>>>>>>>>>>
>>>>>>>>>>> I'll try to add my thoughts.
>>>>>>>>>>>
>>>>>>>>>>> I think building a developer community (Till's 2. point) can be
>>>>>>>>>>>
>>>>>>>>>> slightly
>>>>>>>
>>>>>>>> separated from what features we should aim for (1. point) and
>>>>>>>>>>>
>>>>>>>>>> showcasing
>>>>>>>
>>>>>>>> (3. point). Thanks Till for bringing up the ideas for
>>>>>>>>>>>
>>>>>>>>>> restructuring,
>>>>>
>>>>>> I'm
>>>>>>>
>>>>>>>> sure we'll find a way to make the development process more
>>>>>>>>>>>
>>>>>>>>>> dynamic.
>>>>>
>>>>>> I'll
>>>>>>>
>>>>>>>> try to address the rest here.
>>>>>>>>>>>
>>>>>>>>>>> It's hard to choose directions between streaming and batch ML. As
>>>>>>>>>>>
>>>>>>>>>> Theo
>>>>>>
>>>>>>> has
>>>>>>>>>>> indicated, not much online ML is used in production, but Flink
>>>>>>>>>>> concentrates
>>>>>>>>>>> on streaming, so online ML would be a better fit for Flink.
>>>>>>>>>>>
>>>>>>>>>> However,
>>>>>
>>>>>> as
>>>>>>>
>>>>>>>> most of you argued, there's definite need for batch ML. But batch
>>>>>>>>>>>
>>>>>>>>>> ML
>>>>>
>>>>>> seems
>>>>>>>>>>> hard to achieve because there are blocking issues with
>>>>>>>>>>> persisting,
>>>>>>>>>>> iteration paths etc. So it's no good either way.
>>>>>>>>>>>
>>>>>>>>>>> I propose a seemingly crazy solution: what if we developed batch
>>>>>>>>>>> algorithms also with the streaming API? The batch API would
>>>>>>>>>>>
>>>>>>>>>> clearly
>>>>>
>>>>>> seem
>>>>>>>
>>>>>>>> more suitable for ML algorithms, but there a lot of benefits of
>>>>>>>>>>>
>>>>>>>>>> this
>>>>>
>>>>>> approach too, so it's clearly worth considering. Flink also has
>>>>>>>>>>>
>>>>>>>>>> the
>>>>>
>>>>>> high
>>>>>>>
>>>>>>>> level vision of "streaming for everything" that would clearly fit
>>>>>>>>>>>
>>>>>>>>>> this
>>>>>>
>>>>>>> case. What do you all think about this? Do you think this solution
>>>>>>>>>>>
>>>>>>>>>> would
>>>>>>>
>>>>>>>> be
>>>>>>>>>>> feasible? I would be happy to make a more elaborate proposal, but
>>>>>>>>>>>
>>>>>>>>>> I
>>>>>
>>>>>> push
>>>>>>>
>>>>>>>> my
>>>>>>>>>>> main ideas here:
>>>>>>>>>>>
>>>>>>>>>>> 1) Simplifying by using one system
>>>>>>>>>>> It could simplify the work of both the users and the developers.
>>>>>>>>>>>
>>>>>>>>>> One
>>>>>
>>>>>> could
>>>>>>>>>>> execute training once, or could execute it periodically e.g. by
>>>>>>>>>>>
>>>>>>>>>> using
>>>>>>
>>>>>>> windows. Low-latency serving and training could be done in the
>>>>>>>>>>>
>>>>>>>>>> same
>>>>>
>>>>>> system.
>>>>>>>>>>> We could implement incremental algorithms, without any side
>>>>>>>>>>> inputs
>>>>>>>>>>>
>>>>>>>>>> for
>>>>>>
>>>>>>> combining online learning (or predictions) with batch learning. Of
>>>>>>>>>>> course,
>>>>>>>>>>> all the logic describing these must be somehow implemented (e.g.
>>>>>>>>>>> synchronizing predictions with training), but it should be easier
>>>>>>>>>>>
>>>>>>>>>> to
>>>>>
>>>>>> do
>>>>>>>
>>>>>>>> so
>>>>>>>>>>> in one system, than by combining e.g. the batch and streaming
>>>>>>>>>>> API.
>>>>>>>>>>>
>>>>>>>>>>> 2) Batch ML with the streaming API is not harder
>>>>>>>>>>> Despite these benefits, it could seem harder to implement batch
>>>>>>>>>>> ML
>>>>>>>>>>>
>>>>>>>>>> with
>>>>>>>
>>>>>>>> the streaming API, but in my opinion it's not. There are more
>>>>>>>>>>>
>>>>>>>>>> flexible,
>>>>>>>
>>>>>>>> lower-level optimization potentials with the streaming API. Most
>>>>>>>>>>> distributed ML algorithms use a lower-level model than the batch
>>>>>>>>>>>
>>>>>>>>>> API
>>>>>
>>>>>> anyway, so sometimes it feels like forcing the algorithm logic
>>>>>>>>>>>
>>>>>>>>>> into
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> training API and tweaking it. Although we could not use the batch
>>>>>>>>>>> primitives like join, we would have the E.g. in my experience
>>>>>>>>>>> with
>>>>>>>>>>> implementing a distributed matrix factorization algorithm [1], I
>>>>>>>>>>>
>>>>>>>>>> couldn't
>>>>>>>
>>>>>>>> do a simple optimization because of the limitations of the
>>>>>>>>>>>
>>>>>>>>>> iteration
>>>>>
>>>>>> API
>>>>>>>
>>>>>>>> [2]. Even if we pushed all the development effort to make the
>>>>>>>>>>>
>>>>>>>>>> batch
>>>>>
>>>>>> API
>>>>>>>
>>>>>>>> more suitable for ML there would be things we couldn't do. E.g.
>>>>>>>>>>>
>>>>>>>>>> there
>>>>>>
>>>>>>> are
>>>>>>>
>>>>>>>> approaches for updating a model iteratively without locks [3,4]
>>>>>>>>>>>
>>>>>>>>>> (i.e.
>>>>>>
>>>>>>> somewhat asynchronously), and I don't see a clear way to implement
>>>>>>>>>>>
>>>>>>>>>> such
>>>>>>>
>>>>>>>> algorithms with the batch API.
>>>>>>>>>>>
>>>>>>>>>>> 3) Streaming community (users and devs) benefit
>>>>>>>>>>> The Flink streaming community in general would also benefit from
>>>>>>>>>>>
>>>>>>>>>> this
>>>>>>
>>>>>>> direction. There are many features needed in the streaming API for
>>>>>>>>>>>
>>>>>>>>>> ML
>>>>>>
>>>>>>> to
>>>>>>>
>>>>>>>> work, but this is also true for the batch API. One really
>>>>>>>>>>>
>>>>>>>>>> important
>>>>>
>>>>>> is
>>>>>>
>>>>>>> the
>>>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a
>>>>>>>>>>> lot
>>>>>>>>>>>
>>>>>>>>>> of
>>>>>>
>>>>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
>>>>>>>>>>> mentioned
>>>>>>>>>>> using GPUs, and I'm sure they have uses in streaming generally
>>>>>>>>>>>
>>>>>>>>>> [7].
>>>>>
>>>>>> Thus,
>>>>>>>
>>>>>>>> by improving the streaming API to allow ML algorithms, the
>>>>>>>>>>>
>>>>>>>>>> streaming
>>>>>
>>>>>> API
>>>>>>>
>>>>>>>> benefit too (which is important as they have a lot more production
>>>>>>>>>>>
>>>>>>>>>> users
>>>>>>>
>>>>>>>> than the batch API).
>>>>>>>>>>>
>>>>>>>>>>> 4) Performance can be at least as good
>>>>>>>>>>> I believe the same performance could be achieved with the
>>>>>>>>>>>
>>>>>>>>>> streaming
>>>>>
>>>>>> API
>>>>>>>
>>>>>>>> as
>>>>>>>>>>> with the batch API. Streaming API is much closer to the runtime
>>>>>>>>>>>
>>>>>>>>>> than
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> batch API. For corner-cases, with runtime-layer optimizations of
>>>>>>>>>>>
>>>>>>>>>> batch
>>>>>>
>>>>>>> API,
>>>>>>>>>>> we could find a way to do the same (or similar) optimization for
>>>>>>>>>>>
>>>>>>>>>> the
>>>>>
>>>>>> streaming API (see my previous point). Such case could be using
>>>>>>>>>>>
>>>>>>>>>> managed
>>>>>>>
>>>>>>>> memory (and spilling to disk). There are also benefits by default,
>>>>>>>>>>>
>>>>>>>>>> e.g.
>>>>>>>
>>>>>>>> we
>>>>>>>>>>> would have a finer grained fault tolerance with the streaming
>>>>>>>>>>> API.
>>>>>>>>>>>
>>>>>>>>>>> 5) We could keep batch ML API
>>>>>>>>>>> For the shorter term, we should not throw away all the algorithms
>>>>>>>>>>> implemented with the batch API. By pushing forward the
>>>>>>>>>>> development
>>>>>>>>>>>
>>>>>>>>>> with
>>>>>>>
>>>>>>>> side inputs we could make them usable with streaming API. Then, if
>>>>>>>>>>>
>>>>>>>>>> the
>>>>>>
>>>>>>> library gains some popularity, we could replace the algorithms in
>>>>>>>>>>>
>>>>>>>>>> the
>>>>>>
>>>>>>> batch
>>>>>>>>>>> API with streaming ones, to avoid the performance costs of e.g.
>>>>>>>>>>>
>>>>>>>>>> not
>>>>>
>>>>>> being
>>>>>>>
>>>>>>>> able to persist.
>>>>>>>>>>>
>>>>>>>>>>> 6) General tools for implementing ML algorithms
>>>>>>>>>>> Besides implementing algorithms one by one, we could give more
>>>>>>>>>>>
>>>>>>>>>> general
>>>>>>
>>>>>>> tools for making it easier to implement algorithms. E.g. parameter
>>>>>>>>>>>
>>>>>>>>>> server
>>>>>>>
>>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
>>>>>>>>>>> similar
>>>>>>>>>>> model to Flink streaming, we could look into that too. I think
>>>>>>>>>>>
>>>>>>>>>> often
>>>>>
>>>>>> when
>>>>>>>
>>>>>>>> deploying a production ML system, much more configuration and
>>>>>>>>>>>
>>>>>>>>>> tweaking
>>>>>>
>>>>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>>>>>>>>>>
>>>>>>>>>>> 7) Showcasing
>>>>>>>>>>> Showcasing this could be easier. We could say that we're doing
>>>>>>>>>>>
>>>>>>>>>> batch
>>>>>
>>>>>> ML
>>>>>>>
>>>>>>>> with a streaming API. That's interesting in its own. IMHO this
>>>>>>>>>>> integration
>>>>>>>>>>> is also a more approachable way towards end-to-end ML.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks for reading so far :)
>>>>>>>>>>>
>>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>>>>>>>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pd
>>>>>>>>>>> f
>>>>>>>>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>>>>>>>>>> 13-final77.pdf
>>>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>>>>>>>>>> Scoped+Loops+and+Job+Termination
>>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
>>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.
>>>>>>>>>>>
>>>>>>>>>> pdf
>>>>>
>>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>>>>>>>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>>>>>>>>>> Parameter-Server-implementation-td15880.html
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Gabor
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>
>>>>>> *Yours faithfully, *
>>>>>>
>>>>>> *Kate Eri.*
>>>>>>
>>>>>>
>>
>


-- 
Roberto Bentivoglio
CTO
e. roberto.bentivoglio@radicalbit.io
Radicalbit S.r.l.
radicalbit.io

Re: [DISCUSS] Flink ML roadmap

Posted by Gábor Hermann <ma...@gaborhermann.com>.

Hi Philipp,

It's great to hear you are interested in Flink ML!

Based on your description, your prototype seems like an interesting 
approach for combining online+offline learning. If you're interested, we 
might find a way to integrate your work, or at least your ideas, into 
Flink ML if we decide on a direction that fits your approach. I think 
your work could be relevant for almost all the directions listed there 
(if I understand correctly you'd even like to serve predictions on 
unlabeled data).

Feel free to join the discussion in the docs you've mentioned :)

Cheers,
Gabor

On 2017-02-27 18:39, Philipp Zehnder wrote:

> Hello all,
>
> I\u2019m new to this mailing list and I wanted to introduce myself. My name is Philipp Zehnder and I\u2019m a Masters Student in Computer Science at the Karlsruhe Institute of Technology in Germany currently writing on my master\u2019s thesis with the main goal to integrate reusable machine learning components into a stream processing network. One part of my thesis is to create an API for distributed online machine learning.
>
> I saw that there are some recent discussions how to continue the development of Flink ML [1] and I want to share some of my experiences and maybe get some feedback from the community for my ideas.
>
> As I am new to open source projects I hope this is the right place for this.
>
> In the beginning, I had a look at different already existing frameworks like Apache SAMOA for example, which is great and has a lot of useful resources. However, as Flink is currently focusing on streaming, from my point of view it makes sense to also have a streaming machine learning API as part of the Flink ecosystem.
>
> I\u2019m currently working on building a prototype for a distributed streaming machine learning library based on Flink that can be used for online and \u201cclassical\u201d offline learning.
>
> The machine learning algorithm takes labeled and non-labeled data. On a labeled data point first a prediction is performed and then this label is used to train the model. On a non-labeled data point just a prediction is performed. The main difference between the online and offline algorithms is that in the offline case the labeled data must be handed to the model before the unlabeled data. In the online case, it is still possible to process labeled data at a later point to update the model. The advantage of this approach is that batch algorithms can be applied on streaming data as well as online algorithms can be supported.
>
> One difference to batch learning are the transformers that are used to preprocess the data. For example, a simple mean subtraction must be implemented with a rolling mean, because we can\u2019t calculate the mean over all the data, but the Flink Streaming API is perfect for that. It would be useful for users to have an extensible toolbox of transformers.
>
> Another difference is the evaluation of the models. As we don\u2019t have a single value to determine the model quality, in streaming scenarios this value evolves over time when it sees more labeled data.
>
> However, the transformation and evaluation works again similar in both online learning and offline learning.
>
> I also liked the discussion in [2] and I think that the competition in the batch learning field is hard and there are already a lot of great projects. I think it is true that in most real world problems it is not necessary to update the model immediately, but there are a lot of use cases for machine learning on streams. For them it would be nice to have a native approach.
>
> A stream machine learning API for Flink would fit very well and I would also be willing to contribute to the future development of the Flink ML library.
>
>
>
> Best regards,
>
> Philipp
>
> [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-td16040.html <http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-td16040.html>
> [2] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2 <https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2>
>
>
>> Am 23.02.2017 um 15:48 schrieb G�bor Hermann <ma...@gaborhermann.com>:
>>
>> Okay, I've created a skeleton of the design doc for choosing a direction:
>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/edit?usp=sharing
>>
>> Much of the pros/cons have already been discussed here, so I'll try to put there all the arguments mentioned in this thread. Feel free to put there more :)
>>
>> @Stavros: I agree we should take action fast. What about collecting our thoughts in the doc by around Tuesday next week (28. February)? Then decide on the direction and design a roadmap by around Friday (3. March)? Is that feasible, or should it take more time?
>>
>> I think it will be necessary to have a shepherd, or even better a committer, to be involved in at least reviewing and accepting the roadmap. It would be best, if a committer coordinated all this.
>> @Theodore: Would you like to do the coordination?
>>
>> Regarding the use-cases: I've seen some abstracts of talks at SF Flink Forward [1] that seem promising. There are companies already using Flink for ML [2,3,4,5].
>>
>> [1] http://sf.flink-forward.org/program/sessions/
>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-streaming-vs-micro-batch-for-online-learning/
>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/
>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-learning-on-flink/
>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learning-scenarios-with-flink/
>>
>> Cheers,
>> Gabor
>>
>>
>> On 2017-02-23 15:19, Katherin Eri wrote:
>>> I have asked already some teams for useful cases, but all of them need time
>>> to think.
>>> During analysis something will finally arise.
>>> May be we can ask partners of Flink  for cases? Data Artisans got results
>>> of customers survey: [1], ML better support is wanted, so we could ask what
>>> exactly is necessary.
>>>
>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
>>>
>>> 23 \u0444\u0435\u0432\u0440. 2017 \u0433. 4:32 PM \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c "Stavros Kontopoulos" <
>>> st.kontopoulos@gmail.com> \u043d\u0430\u043f\u0438\u0441\u0430\u043b:
>>>
>>>> +100 for a design doc.
>>>>
>>>> Could we also set a roadmap after some time-boxed investigation captured in
>>>> that document? We need action.
>>>>
>>>> Looking forward to work on this (whatever that might be) ;) Also are there
>>>> any data supporting one direction or the other from a customer perspective?
>>>> It would help to make more informed decisions.
>>>>
>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> Yes, ok.
>>>>> let's start some design document, and write down there already mentioned
>>>>> ideas about: parameter server, about clipper and others. Would be nice if
>>>>> we will also map this approaches to cases.
>>>>> Will work on it collaboratively on each topic, may be finally we will
>>>> form
>>>>> some picture, that could be agreed with committers.
>>>>> @Gabor, could you please start such shared doc, as you have already
>>>> several
>>>>> ideas proposed?
>>>>>
>>>>> \u0447\u0442, 23 \u0444\u0435\u0432\u0440. 2017, 15:06 G�bor Hermann <ma...@gaborhermann.com>:
>>>>>
>>>>>> I agree, that it's better to go in one direction first, but I think
>>>>>> online and offline with streaming API can go somewhat parallel later.
>>>> We
>>>>>> could set a short-term goal, concentrate initially on one direction,
>>>> and
>>>>>> showcase that direction (e.g. in a blogpost). But first, we should list
>>>>>> the pros/cons in a design doc as a minimum. Then make a decision what
>>>>>> direction to go. Would that be feasible?
>>>>>>
>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
>>>>>>
>>>>>>> I'm not sure that this is feasible, doing all at the same time could
>>>>> mean
>>>>>>> doing nothing((((
>>>>>>> I'm just afraid, that words: we will work on streaming not on
>>>> batching,
>>>>>> we
>>>>>>> have no commiter's time for this, mean that yes, we started work on
>>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
>>>> already
>>>>>> was
>>>>>>> with this ticket.
>>>>>>>
>>>>>>> 23 \u0444\u0435\u0432\u0440. 2017 \u0433. 14:26 \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c "G�bor Hermann" <
>>>>>> mail@gaborhermann.com>
>>>>>>> \u043d\u0430\u043f\u0438\u0441\u0430\u043b:
>>>>>>>
>>>>>>>> @Theodore: Great to hear you think the "batch on streaming" approach
>>>>> is
>>>>>>>> possible! Of course, we need to pay attention all the pitfalls
>>>> there,
>>>>>> if we
>>>>>>>> go that way.
>>>>>>>>
>>>>>>>> +1 for a design doc!
>>>>>>>>
>>>>>>>> I would add that it's possible to make efforts in all the three
>>>>>> directions
>>>>>>>> (i.e. batch, online, batch on streaming) at the same time. Although,
>>>>> it
>>>>>>>> might be worth to concentrate on one. E.g. it would not be so useful
>>>>> to
>>>>>>>> have the same batch algorithms with both the batch API and streaming
>>>>>> API.
>>>>>>>> We can decide later.
>>>>>>>>
>>>>>>>> The design doc could be partitioned to these 3 directions, and we
>>>> can
>>>>>>>> collect there the pros/cons too. What do you think?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Gabor
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>>>>>>>>
>>>>>>>>> Hello all,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> @Gabor, we have discussed the idea of using the streaming API to
>>>>> write
>>>>>> all
>>>>>>>>> of our ML algorithms with a couple of people offline,
>>>>>>>>> and I think it might be possible and is generally worth a shot. The
>>>>>>>>> approach we would take would be close to Vowpal Wabbit, not exactly
>>>>>>>>> "online", but rather "fast-batch".
>>>>>>>>>
>>>>>>>>> There will be problems popping up again, even for very simple algos
>>>>>> like
>>>>>>>>> on
>>>>>>>>> line linear regression with SGD [1], but hopefully fixing those
>>>> will
>>>>> be
>>>>>>>>> more aligned with the priorities of the community.
>>>>>>>>>
>>>>>>>>> @Katherin, my understanding is that given the limited resources,
>>>>> there
>>>>>> is
>>>>>>>>> no development effort focused on batch processing right now.
>>>>>>>>>
>>>>>>>>> So to summarize, it seems like there are people willing to work on
>>>> ML
>>>>>> on
>>>>>>>>> Flink, but nobody is sure how to do it.
>>>>>>>>> There are many directions we could take (batch, online, batch on
>>>>>>>>> streaming), each with its own merits and downsides.
>>>>>>>>>
>>>>>>>>> If you want we can start a design doc and move the conversation
>>>>> there,
>>>>>>>>> come
>>>>>>>>> up with a roadmap and start implementing.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Theodore
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
>>>>>>>>> tamps-td10241.html
>>>>>>>>>
>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, G�bor Hermann <
>>>>> mail@gaborhermann.com
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> It's great to see so much activity in this discussion :)
>>>>>>>>>> I'll try to add my thoughts.
>>>>>>>>>>
>>>>>>>>>> I think building a developer community (Till's 2. point) can be
>>>>>> slightly
>>>>>>>>>> separated from what features we should aim for (1. point) and
>>>>>> showcasing
>>>>>>>>>> (3. point). Thanks Till for bringing up the ideas for
>>>> restructuring,
>>>>>> I'm
>>>>>>>>>> sure we'll find a way to make the development process more
>>>> dynamic.
>>>>>> I'll
>>>>>>>>>> try to address the rest here.
>>>>>>>>>>
>>>>>>>>>> It's hard to choose directions between streaming and batch ML. As
>>>>> Theo
>>>>>>>>>> has
>>>>>>>>>> indicated, not much online ML is used in production, but Flink
>>>>>>>>>> concentrates
>>>>>>>>>> on streaming, so online ML would be a better fit for Flink.
>>>> However,
>>>>>> as
>>>>>>>>>> most of you argued, there's definite need for batch ML. But batch
>>>> ML
>>>>>>>>>> seems
>>>>>>>>>> hard to achieve because there are blocking issues with persisting,
>>>>>>>>>> iteration paths etc. So it's no good either way.
>>>>>>>>>>
>>>>>>>>>> I propose a seemingly crazy solution: what if we developed batch
>>>>>>>>>> algorithms also with the streaming API? The batch API would
>>>> clearly
>>>>>> seem
>>>>>>>>>> more suitable for ML algorithms, but there a lot of benefits of
>>>> this
>>>>>>>>>> approach too, so it's clearly worth considering. Flink also has
>>>> the
>>>>>> high
>>>>>>>>>> level vision of "streaming for everything" that would clearly fit
>>>>> this
>>>>>>>>>> case. What do you all think about this? Do you think this solution
>>>>>> would
>>>>>>>>>> be
>>>>>>>>>> feasible? I would be happy to make a more elaborate proposal, but
>>>> I
>>>>>> push
>>>>>>>>>> my
>>>>>>>>>> main ideas here:
>>>>>>>>>>
>>>>>>>>>> 1) Simplifying by using one system
>>>>>>>>>> It could simplify the work of both the users and the developers.
>>>> One
>>>>>>>>>> could
>>>>>>>>>> execute training once, or could execute it periodically e.g. by
>>>>> using
>>>>>>>>>> windows. Low-latency serving and training could be done in the
>>>> same
>>>>>>>>>> system.
>>>>>>>>>> We could implement incremental algorithms, without any side inputs
>>>>> for
>>>>>>>>>> combining online learning (or predictions) with batch learning. Of
>>>>>>>>>> course,
>>>>>>>>>> all the logic describing these must be somehow implemented (e.g.
>>>>>>>>>> synchronizing predictions with training), but it should be easier
>>>> to
>>>>>> do
>>>>>>>>>> so
>>>>>>>>>> in one system, than by combining e.g. the batch and streaming API.
>>>>>>>>>>
>>>>>>>>>> 2) Batch ML with the streaming API is not harder
>>>>>>>>>> Despite these benefits, it could seem harder to implement batch ML
>>>>>> with
>>>>>>>>>> the streaming API, but in my opinion it's not. There are more
>>>>>> flexible,
>>>>>>>>>> lower-level optimization potentials with the streaming API. Most
>>>>>>>>>> distributed ML algorithms use a lower-level model than the batch
>>>> API
>>>>>>>>>> anyway, so sometimes it feels like forcing the algorithm logic
>>>> into
>>>>>> the
>>>>>>>>>> training API and tweaking it. Although we could not use the batch
>>>>>>>>>> primitives like join, we would have the E.g. in my experience with
>>>>>>>>>> implementing a distributed matrix factorization algorithm [1], I
>>>>>> couldn't
>>>>>>>>>> do a simple optimization because of the limitations of the
>>>> iteration
>>>>>> API
>>>>>>>>>> [2]. Even if we pushed all the development effort to make the
>>>> batch
>>>>>> API
>>>>>>>>>> more suitable for ML there would be things we couldn't do. E.g.
>>>>> there
>>>>>> are
>>>>>>>>>> approaches for updating a model iteratively without locks [3,4]
>>>>> (i.e.
>>>>>>>>>> somewhat asynchronously), and I don't see a clear way to implement
>>>>>> such
>>>>>>>>>> algorithms with the batch API.
>>>>>>>>>>
>>>>>>>>>> 3) Streaming community (users and devs) benefit
>>>>>>>>>> The Flink streaming community in general would also benefit from
>>>>> this
>>>>>>>>>> direction. There are many features needed in the streaming API for
>>>>> ML
>>>>>> to
>>>>>>>>>> work, but this is also true for the batch API. One really
>>>> important
>>>>> is
>>>>>>>>>> the
>>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot
>>>>> of
>>>>>>>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
>>>>>>>>>> mentioned
>>>>>>>>>> using GPUs, and I'm sure they have uses in streaming generally
>>>> [7].
>>>>>> Thus,
>>>>>>>>>> by improving the streaming API to allow ML algorithms, the
>>>> streaming
>>>>>> API
>>>>>>>>>> benefit too (which is important as they have a lot more production
>>>>>> users
>>>>>>>>>> than the batch API).
>>>>>>>>>>
>>>>>>>>>> 4) Performance can be at least as good
>>>>>>>>>> I believe the same performance could be achieved with the
>>>> streaming
>>>>>> API
>>>>>>>>>> as
>>>>>>>>>> with the batch API. Streaming API is much closer to the runtime
>>>> than
>>>>>> the
>>>>>>>>>> batch API. For corner-cases, with runtime-layer optimizations of
>>>>> batch
>>>>>>>>>> API,
>>>>>>>>>> we could find a way to do the same (or similar) optimization for
>>>> the
>>>>>>>>>> streaming API (see my previous point). Such case could be using
>>>>>> managed
>>>>>>>>>> memory (and spilling to disk). There are also benefits by default,
>>>>>> e.g.
>>>>>>>>>> we
>>>>>>>>>> would have a finer grained fault tolerance with the streaming API.
>>>>>>>>>>
>>>>>>>>>> 5) We could keep batch ML API
>>>>>>>>>> For the shorter term, we should not throw away all the algorithms
>>>>>>>>>> implemented with the batch API. By pushing forward the development
>>>>>> with
>>>>>>>>>> side inputs we could make them usable with streaming API. Then, if
>>>>> the
>>>>>>>>>> library gains some popularity, we could replace the algorithms in
>>>>> the
>>>>>>>>>> batch
>>>>>>>>>> API with streaming ones, to avoid the performance costs of e.g.
>>>> not
>>>>>> being
>>>>>>>>>> able to persist.
>>>>>>>>>>
>>>>>>>>>> 6) General tools for implementing ML algorithms
>>>>>>>>>> Besides implementing algorithms one by one, we could give more
>>>>> general
>>>>>>>>>> tools for making it easier to implement algorithms. E.g. parameter
>>>>>> server
>>>>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
>>>>>>>>>> similar
>>>>>>>>>> model to Flink streaming, we could look into that too. I think
>>>> often
>>>>>> when
>>>>>>>>>> deploying a production ML system, much more configuration and
>>>>> tweaking
>>>>>>>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>>>>>>>>>
>>>>>>>>>> 7) Showcasing
>>>>>>>>>> Showcasing this could be easier. We could say that we're doing
>>>> batch
>>>>>> ML
>>>>>>>>>> with a streaming API. That's interesting in its own. IMHO this
>>>>>>>>>> integration
>>>>>>>>>> is also a more approachable way towards end-to-end ML.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks for reading so far :)
>>>>>>>>>>
>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>>>>>>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
>>>>>>>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>>>>>>>>> 13-final77.pdf
>>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>>>>>>>>> Scoped+Loops+and+Job+Termination
>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.
>>>> pdf
>>>>>>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>>>>>>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>>>>>>>>> Parameter-Server-implementation-td15880.html
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Gabor
>>>>>>>>>>
>>>>>>>>>>
>>>>>> --
>>>>> *Yours faithfully, *
>>>>>
>>>>> *Kate Eri.*
>>>>>
>

Re: [DISCUSS] Flink ML roadmap

Posted by Philipp Zehnder <ph...@gmx.de>.

Hello all,

I’m new to this mailing list and I wanted to introduce myself. My name is Philipp Zehnder and I’m a Masters Student in Computer Science at the Karlsruhe Institute of Technology in Germany currently writing on my master’s thesis with the main goal to integrate reusable machine learning components into a stream processing network. One part of my thesis is to create an API for distributed online machine learning.

I saw that there are some recent discussions how to continue the development of Flink ML [1] and I want to share some of my experiences and maybe get some feedback from the community for my ideas.

As I am new to open source projects I hope this is the right place for this.

In the beginning, I had a look at different already existing frameworks like Apache SAMOA for example, which is great and has a lot of useful resources. However, as Flink is currently focusing on streaming, from my point of view it makes sense to also have a streaming machine learning API as part of the Flink ecosystem.

I’m currently working on building a prototype for a distributed streaming machine learning library based on Flink that can be used for online and “classical” offline learning.

The machine learning algorithm takes labeled and non-labeled data. On a labeled data point first a prediction is performed and then this label is used to train the model. On a non-labeled data point just a prediction is performed. The main difference between the online and offline algorithms is that in the offline case the labeled data must be handed to the model before the unlabeled data. In the online case, it is still possible to process labeled data at a later point to update the model. The advantage of this approach is that batch algorithms can be applied on streaming data as well as online algorithms can be supported.

One difference to batch learning are the transformers that are used to preprocess the data. For example, a simple mean subtraction must be implemented with a rolling mean, because we can’t calculate the mean over all the data, but the Flink Streaming API is perfect for that. It would be useful for users to have an extensible toolbox of transformers.

Another difference is the evaluation of the models. As we don’t have a single value to determine the model quality, in streaming scenarios this value evolves over time when it sees more labeled data.

However, the transformation and evaluation works again similar in both online learning and offline learning. 

I also liked the discussion in [2] and I think that the competition in the batch learning field is hard and there are already a lot of great projects. I think it is true that in most real world problems it is not necessary to update the model immediately, but there are a lot of use cases for machine learning on streams. For them it would be nice to have a native approach.

A stream machine learning API for Flink would fit very well and I would also be willing to contribute to the future development of the Flink ML library.



Best regards,

Philipp

[1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-td16040.html <http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-td16040.html>
[2] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2 <https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2>


> Am 23.02.2017 um 15:48 schrieb Gábor Hermann <ma...@gaborhermann.com>:
> 
> Okay, I've created a skeleton of the design doc for choosing a direction:
> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/edit?usp=sharing
> 
> Much of the pros/cons have already been discussed here, so I'll try to put there all the arguments mentioned in this thread. Feel free to put there more :)
> 
> @Stavros: I agree we should take action fast. What about collecting our thoughts in the doc by around Tuesday next week (28. February)? Then decide on the direction and design a roadmap by around Friday (3. March)? Is that feasible, or should it take more time?
> 
> I think it will be necessary to have a shepherd, or even better a committer, to be involved in at least reviewing and accepting the roadmap. It would be best, if a committer coordinated all this.
> @Theodore: Would you like to do the coordination?
> 
> Regarding the use-cases: I've seen some abstracts of talks at SF Flink Forward [1] that seem promising. There are companies already using Flink for ML [2,3,4,5].
> 
> [1] http://sf.flink-forward.org/program/sessions/
> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-streaming-vs-micro-batch-for-online-learning/
> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/
> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-learning-on-flink/
> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learning-scenarios-with-flink/
> 
> Cheers,
> Gabor
> 
> 
> On 2017-02-23 15:19, Katherin Eri wrote:
>> I have asked already some teams for useful cases, but all of them need time
>> to think.
>> During analysis something will finally arise.
>> May be we can ask partners of Flink  for cases? Data Artisans got results
>> of customers survey: [1], ML better support is wanted, so we could ask what
>> exactly is necessary.
>> 
>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
>> 
>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
>> st.kontopoulos@gmail.com> написал:
>> 
>>> +100 for a design doc.
>>> 
>>> Could we also set a roadmap after some time-boxed investigation captured in
>>> that document? We need action.
>>> 
>>> Looking forward to work on this (whatever that might be) ;) Also are there
>>> any data supporting one direction or the other from a customer perspective?
>>> It would help to make more informed decisions.
>>> 
>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <ka...@gmail.com>
>>> wrote:
>>> 
>>>> Yes, ok.
>>>> let's start some design document, and write down there already mentioned
>>>> ideas about: parameter server, about clipper and others. Would be nice if
>>>> we will also map this approaches to cases.
>>>> Will work on it collaboratively on each topic, may be finally we will
>>> form
>>>> some picture, that could be agreed with committers.
>>>> @Gabor, could you please start such shared doc, as you have already
>>> several
>>>> ideas proposed?
>>>> 
>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <ma...@gaborhermann.com>:
>>>> 
>>>>> I agree, that it's better to go in one direction first, but I think
>>>>> online and offline with streaming API can go somewhat parallel later.
>>> We
>>>>> could set a short-term goal, concentrate initially on one direction,
>>> and
>>>>> showcase that direction (e.g. in a blogpost). But first, we should list
>>>>> the pros/cons in a design doc as a minimum. Then make a decision what
>>>>> direction to go. Would that be feasible?
>>>>> 
>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
>>>>> 
>>>>>> I'm not sure that this is feasible, doing all at the same time could
>>>> mean
>>>>>> doing nothing((((
>>>>>> I'm just afraid, that words: we will work on streaming not on
>>> batching,
>>>>> we
>>>>>> have no commiter's time for this, mean that yes, we started work on
>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
>>> already
>>>>> was
>>>>>> with this ticket.
>>>>>> 
>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
>>>>> mail@gaborhermann.com>
>>>>>> написал:
>>>>>> 
>>>>>>> @Theodore: Great to hear you think the "batch on streaming" approach
>>>> is
>>>>>>> possible! Of course, we need to pay attention all the pitfalls
>>> there,
>>>>> if we
>>>>>>> go that way.
>>>>>>> 
>>>>>>> +1 for a design doc!
>>>>>>> 
>>>>>>> I would add that it's possible to make efforts in all the three
>>>>> directions
>>>>>>> (i.e. batch, online, batch on streaming) at the same time. Although,
>>>> it
>>>>>>> might be worth to concentrate on one. E.g. it would not be so useful
>>>> to
>>>>>>> have the same batch algorithms with both the batch API and streaming
>>>>> API.
>>>>>>> We can decide later.
>>>>>>> 
>>>>>>> The design doc could be partitioned to these 3 directions, and we
>>> can
>>>>>>> collect there the pros/cons too. What do you think?
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Gabor
>>>>>>> 
>>>>>>> 
>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>>>>>>> 
>>>>>>>> Hello all,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> @Gabor, we have discussed the idea of using the streaming API to
>>>> write
>>>>> all
>>>>>>>> of our ML algorithms with a couple of people offline,
>>>>>>>> and I think it might be possible and is generally worth a shot. The
>>>>>>>> approach we would take would be close to Vowpal Wabbit, not exactly
>>>>>>>> "online", but rather "fast-batch".
>>>>>>>> 
>>>>>>>> There will be problems popping up again, even for very simple algos
>>>>> like
>>>>>>>> on
>>>>>>>> line linear regression with SGD [1], but hopefully fixing those
>>> will
>>>> be
>>>>>>>> more aligned with the priorities of the community.
>>>>>>>> 
>>>>>>>> @Katherin, my understanding is that given the limited resources,
>>>> there
>>>>> is
>>>>>>>> no development effort focused on batch processing right now.
>>>>>>>> 
>>>>>>>> So to summarize, it seems like there are people willing to work on
>>> ML
>>>>> on
>>>>>>>> Flink, but nobody is sure how to do it.
>>>>>>>> There are many directions we could take (batch, online, batch on
>>>>>>>> streaming), each with its own merits and downsides.
>>>>>>>> 
>>>>>>>> If you want we can start a design doc and move the conversation
>>>> there,
>>>>>>>> come
>>>>>>>> up with a roadmap and start implementing.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Theodore
>>>>>>>> 
>>>>>>>> [1]
>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
>>>>>>>> tamps-td10241.html
>>>>>>>> 
>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
>>>> mail@gaborhermann.com
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> It's great to see so much activity in this discussion :)
>>>>>>>>> I'll try to add my thoughts.
>>>>>>>>> 
>>>>>>>>> I think building a developer community (Till's 2. point) can be
>>>>> slightly
>>>>>>>>> separated from what features we should aim for (1. point) and
>>>>> showcasing
>>>>>>>>> (3. point). Thanks Till for bringing up the ideas for
>>> restructuring,
>>>>> I'm
>>>>>>>>> sure we'll find a way to make the development process more
>>> dynamic.
>>>>> I'll
>>>>>>>>> try to address the rest here.
>>>>>>>>> 
>>>>>>>>> It's hard to choose directions between streaming and batch ML. As
>>>> Theo
>>>>>>>>> has
>>>>>>>>> indicated, not much online ML is used in production, but Flink
>>>>>>>>> concentrates
>>>>>>>>> on streaming, so online ML would be a better fit for Flink.
>>> However,
>>>>> as
>>>>>>>>> most of you argued, there's definite need for batch ML. But batch
>>> ML
>>>>>>>>> seems
>>>>>>>>> hard to achieve because there are blocking issues with persisting,
>>>>>>>>> iteration paths etc. So it's no good either way.
>>>>>>>>> 
>>>>>>>>> I propose a seemingly crazy solution: what if we developed batch
>>>>>>>>> algorithms also with the streaming API? The batch API would
>>> clearly
>>>>> seem
>>>>>>>>> more suitable for ML algorithms, but there a lot of benefits of
>>> this
>>>>>>>>> approach too, so it's clearly worth considering. Flink also has
>>> the
>>>>> high
>>>>>>>>> level vision of "streaming for everything" that would clearly fit
>>>> this
>>>>>>>>> case. What do you all think about this? Do you think this solution
>>>>> would
>>>>>>>>> be
>>>>>>>>> feasible? I would be happy to make a more elaborate proposal, but
>>> I
>>>>> push
>>>>>>>>> my
>>>>>>>>> main ideas here:
>>>>>>>>> 
>>>>>>>>> 1) Simplifying by using one system
>>>>>>>>> It could simplify the work of both the users and the developers.
>>> One
>>>>>>>>> could
>>>>>>>>> execute training once, or could execute it periodically e.g. by
>>>> using
>>>>>>>>> windows. Low-latency serving and training could be done in the
>>> same
>>>>>>>>> system.
>>>>>>>>> We could implement incremental algorithms, without any side inputs
>>>> for
>>>>>>>>> combining online learning (or predictions) with batch learning. Of
>>>>>>>>> course,
>>>>>>>>> all the logic describing these must be somehow implemented (e.g.
>>>>>>>>> synchronizing predictions with training), but it should be easier
>>> to
>>>>> do
>>>>>>>>> so
>>>>>>>>> in one system, than by combining e.g. the batch and streaming API.
>>>>>>>>> 
>>>>>>>>> 2) Batch ML with the streaming API is not harder
>>>>>>>>> Despite these benefits, it could seem harder to implement batch ML
>>>>> with
>>>>>>>>> the streaming API, but in my opinion it's not. There are more
>>>>> flexible,
>>>>>>>>> lower-level optimization potentials with the streaming API. Most
>>>>>>>>> distributed ML algorithms use a lower-level model than the batch
>>> API
>>>>>>>>> anyway, so sometimes it feels like forcing the algorithm logic
>>> into
>>>>> the
>>>>>>>>> training API and tweaking it. Although we could not use the batch
>>>>>>>>> primitives like join, we would have the E.g. in my experience with
>>>>>>>>> implementing a distributed matrix factorization algorithm [1], I
>>>>> couldn't
>>>>>>>>> do a simple optimization because of the limitations of the
>>> iteration
>>>>> API
>>>>>>>>> [2]. Even if we pushed all the development effort to make the
>>> batch
>>>>> API
>>>>>>>>> more suitable for ML there would be things we couldn't do. E.g.
>>>> there
>>>>> are
>>>>>>>>> approaches for updating a model iteratively without locks [3,4]
>>>> (i.e.
>>>>>>>>> somewhat asynchronously), and I don't see a clear way to implement
>>>>> such
>>>>>>>>> algorithms with the batch API.
>>>>>>>>> 
>>>>>>>>> 3) Streaming community (users and devs) benefit
>>>>>>>>> The Flink streaming community in general would also benefit from
>>>> this
>>>>>>>>> direction. There are many features needed in the streaming API for
>>>> ML
>>>>> to
>>>>>>>>> work, but this is also true for the batch API. One really
>>> important
>>>> is
>>>>>>>>> the
>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot
>>>> of
>>>>>>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
>>>>>>>>> mentioned
>>>>>>>>> using GPUs, and I'm sure they have uses in streaming generally
>>> [7].
>>>>> Thus,
>>>>>>>>> by improving the streaming API to allow ML algorithms, the
>>> streaming
>>>>> API
>>>>>>>>> benefit too (which is important as they have a lot more production
>>>>> users
>>>>>>>>> than the batch API).
>>>>>>>>> 
>>>>>>>>> 4) Performance can be at least as good
>>>>>>>>> I believe the same performance could be achieved with the
>>> streaming
>>>>> API
>>>>>>>>> as
>>>>>>>>> with the batch API. Streaming API is much closer to the runtime
>>> than
>>>>> the
>>>>>>>>> batch API. For corner-cases, with runtime-layer optimizations of
>>>> batch
>>>>>>>>> API,
>>>>>>>>> we could find a way to do the same (or similar) optimization for
>>> the
>>>>>>>>> streaming API (see my previous point). Such case could be using
>>>>> managed
>>>>>>>>> memory (and spilling to disk). There are also benefits by default,
>>>>> e.g.
>>>>>>>>> we
>>>>>>>>> would have a finer grained fault tolerance with the streaming API.
>>>>>>>>> 
>>>>>>>>> 5) We could keep batch ML API
>>>>>>>>> For the shorter term, we should not throw away all the algorithms
>>>>>>>>> implemented with the batch API. By pushing forward the development
>>>>> with
>>>>>>>>> side inputs we could make them usable with streaming API. Then, if
>>>> the
>>>>>>>>> library gains some popularity, we could replace the algorithms in
>>>> the
>>>>>>>>> batch
>>>>>>>>> API with streaming ones, to avoid the performance costs of e.g.
>>> not
>>>>> being
>>>>>>>>> able to persist.
>>>>>>>>> 
>>>>>>>>> 6) General tools for implementing ML algorithms
>>>>>>>>> Besides implementing algorithms one by one, we could give more
>>>> general
>>>>>>>>> tools for making it easier to implement algorithms. E.g. parameter
>>>>> server
>>>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
>>>>>>>>> similar
>>>>>>>>> model to Flink streaming, we could look into that too. I think
>>> often
>>>>> when
>>>>>>>>> deploying a production ML system, much more configuration and
>>>> tweaking
>>>>>>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>>>>>>>> 
>>>>>>>>> 7) Showcasing
>>>>>>>>> Showcasing this could be easier. We could say that we're doing
>>> batch
>>>>> ML
>>>>>>>>> with a streaming API. That's interesting in its own. IMHO this
>>>>>>>>> integration
>>>>>>>>> is also a more approachable way towards end-to-end ML.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks for reading so far :)
>>>>>>>>> 
>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>>>>>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
>>>>>>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>>>>>>>> 13-final77.pdf
>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>>>>>>>> Scoped+Loops+and+Job+Termination
>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.
>>> pdf
>>>>>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>>>>>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>>>>>>>> Parameter-Server-implementation-td15880.html
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Gabor
>>>>>>>>> 
>>>>>>>>> 
>>>>> --
>>>> *Yours faithfully, *
>>>> 
>>>> *Kate Eri.*
>>>> 
>

Re: [DISCUSS] Flink ML roadmap

Posted by Gábor Hermann <ma...@gaborhermann.com>.

Okay, I've created a skeleton of the design doc for choosing a direction:
https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/edit?usp=sharing

Much of the pros/cons have already been discussed here, so I'll try to 
put there all the arguments mentioned in this thread. Feel free to put 
there more :)

@Stavros: I agree we should take action fast. What about collecting our 
thoughts in the doc by around Tuesday next week (28. February)? Then 
decide on the direction and design a roadmap by around Friday (3. 
March)? Is that feasible, or should it take more time?

I think it will be necessary to have a shepherd, or even better a 
committer, to be involved in at least reviewing and accepting the 
roadmap. It would be best, if a committer coordinated all this.
@Theodore: Would you like to do the coordination?

Regarding the use-cases: I've seen some abstracts of talks at SF Flink 
Forward [1] that seem promising. There are companies already using Flink 
for ML [2,3,4,5].

[1] http://sf.flink-forward.org/program/sessions/
[2] 
http://sf.flink-forward.org/kb_sessions/experiences-with-streaming-vs-micro-batch-for-online-learning/
[3] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/
[4] 
http://sf.flink-forward.org/kb_sessions/non-flink-machine-learning-on-flink/
[5] 
http://sf.flink-forward.org/kb_sessions/streaming-deep-learning-scenarios-with-flink/

Cheers,
Gabor


On 2017-02-23 15:19, Katherin Eri wrote:
> I have asked already some teams for useful cases, but all of them need time
> to think.
> During analysis something will finally arise.
> May be we can ask partners of Flink  for cases? Data Artisans got results
> of customers survey: [1], ML better support is wanted, so we could ask what
> exactly is necessary.
>
> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
>
> 23 \u0444\u0435\u0432\u0440. 2017 \u0433. 4:32 PM \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c "Stavros Kontopoulos" <
> st.kontopoulos@gmail.com> \u043d\u0430\u043f\u0438\u0441\u0430\u043b:
>
>> +100 for a design doc.
>>
>> Could we also set a roadmap after some time-boxed investigation captured in
>> that document? We need action.
>>
>> Looking forward to work on this (whatever that might be) ;) Also are there
>> any data supporting one direction or the other from a customer perspective?
>> It would help to make more informed decisions.
>>
>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <ka...@gmail.com>
>> wrote:
>>
>>> Yes, ok.
>>> let's start some design document, and write down there already mentioned
>>> ideas about: parameter server, about clipper and others. Would be nice if
>>> we will also map this approaches to cases.
>>> Will work on it collaboratively on each topic, may be finally we will
>> form
>>> some picture, that could be agreed with committers.
>>> @Gabor, could you please start such shared doc, as you have already
>> several
>>> ideas proposed?
>>>
>>> \u0447\u0442, 23 \u0444\u0435\u0432\u0440. 2017, 15:06 G�bor Hermann <ma...@gaborhermann.com>:
>>>
>>>> I agree, that it's better to go in one direction first, but I think
>>>> online and offline with streaming API can go somewhat parallel later.
>> We
>>>> could set a short-term goal, concentrate initially on one direction,
>> and
>>>> showcase that direction (e.g. in a blogpost). But first, we should list
>>>> the pros/cons in a design doc as a minimum. Then make a decision what
>>>> direction to go. Would that be feasible?
>>>>
>>>> On 2017-02-23 12:34, Katherin Eri wrote:
>>>>
>>>>> I'm not sure that this is feasible, doing all at the same time could
>>> mean
>>>>> doing nothing((((
>>>>> I'm just afraid, that words: we will work on streaming not on
>> batching,
>>>> we
>>>>> have no commiter's time for this, mean that yes, we started work on
>>>>> FLINK-1730, but nobody will commit this work in the end, as it
>> already
>>>> was
>>>>> with this ticket.
>>>>>
>>>>> 23 \u0444\u0435\u0432\u0440. 2017 \u0433. 14:26 \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c "G�bor Hermann" <
>>>> mail@gaborhermann.com>
>>>>> \u043d\u0430\u043f\u0438\u0441\u0430\u043b:
>>>>>
>>>>>> @Theodore: Great to hear you think the "batch on streaming" approach
>>> is
>>>>>> possible! Of course, we need to pay attention all the pitfalls
>> there,
>>>> if we
>>>>>> go that way.
>>>>>>
>>>>>> +1 for a design doc!
>>>>>>
>>>>>> I would add that it's possible to make efforts in all the three
>>>> directions
>>>>>> (i.e. batch, online, batch on streaming) at the same time. Although,
>>> it
>>>>>> might be worth to concentrate on one. E.g. it would not be so useful
>>> to
>>>>>> have the same batch algorithms with both the batch API and streaming
>>>> API.
>>>>>> We can decide later.
>>>>>>
>>>>>> The design doc could be partitioned to these 3 directions, and we
>> can
>>>>>> collect there the pros/cons too. What do you think?
>>>>>>
>>>>>> Cheers,
>>>>>> Gabor
>>>>>>
>>>>>>
>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>>
>>>>>>> @Gabor, we have discussed the idea of using the streaming API to
>>> write
>>>> all
>>>>>>> of our ML algorithms with a couple of people offline,
>>>>>>> and I think it might be possible and is generally worth a shot. The
>>>>>>> approach we would take would be close to Vowpal Wabbit, not exactly
>>>>>>> "online", but rather "fast-batch".
>>>>>>>
>>>>>>> There will be problems popping up again, even for very simple algos
>>>> like
>>>>>>> on
>>>>>>> line linear regression with SGD [1], but hopefully fixing those
>> will
>>> be
>>>>>>> more aligned with the priorities of the community.
>>>>>>>
>>>>>>> @Katherin, my understanding is that given the limited resources,
>>> there
>>>> is
>>>>>>> no development effort focused on batch processing right now.
>>>>>>>
>>>>>>> So to summarize, it seems like there are people willing to work on
>> ML
>>>> on
>>>>>>> Flink, but nobody is sure how to do it.
>>>>>>> There are many directions we could take (batch, online, batch on
>>>>>>> streaming), each with its own merits and downsides.
>>>>>>>
>>>>>>> If you want we can start a design doc and move the conversation
>>> there,
>>>>>>> come
>>>>>>> up with a roadmap and start implementing.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Theodore
>>>>>>>
>>>>>>> [1]
>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
>>>>>>> tamps-td10241.html
>>>>>>>
>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, G�bor Hermann <
>>> mail@gaborhermann.com
>>>>>>> wrote:
>>>>>>>
>>>>>>> It's great to see so much activity in this discussion :)
>>>>>>>> I'll try to add my thoughts.
>>>>>>>>
>>>>>>>> I think building a developer community (Till's 2. point) can be
>>>> slightly
>>>>>>>> separated from what features we should aim for (1. point) and
>>>> showcasing
>>>>>>>> (3. point). Thanks Till for bringing up the ideas for
>> restructuring,
>>>> I'm
>>>>>>>> sure we'll find a way to make the development process more
>> dynamic.
>>>> I'll
>>>>>>>> try to address the rest here.
>>>>>>>>
>>>>>>>> It's hard to choose directions between streaming and batch ML. As
>>> Theo
>>>>>>>> has
>>>>>>>> indicated, not much online ML is used in production, but Flink
>>>>>>>> concentrates
>>>>>>>> on streaming, so online ML would be a better fit for Flink.
>> However,
>>>> as
>>>>>>>> most of you argued, there's definite need for batch ML. But batch
>> ML
>>>>>>>> seems
>>>>>>>> hard to achieve because there are blocking issues with persisting,
>>>>>>>> iteration paths etc. So it's no good either way.
>>>>>>>>
>>>>>>>> I propose a seemingly crazy solution: what if we developed batch
>>>>>>>> algorithms also with the streaming API? The batch API would
>> clearly
>>>> seem
>>>>>>>> more suitable for ML algorithms, but there a lot of benefits of
>> this
>>>>>>>> approach too, so it's clearly worth considering. Flink also has
>> the
>>>> high
>>>>>>>> level vision of "streaming for everything" that would clearly fit
>>> this
>>>>>>>> case. What do you all think about this? Do you think this solution
>>>> would
>>>>>>>> be
>>>>>>>> feasible? I would be happy to make a more elaborate proposal, but
>> I
>>>> push
>>>>>>>> my
>>>>>>>> main ideas here:
>>>>>>>>
>>>>>>>> 1) Simplifying by using one system
>>>>>>>> It could simplify the work of both the users and the developers.
>> One
>>>>>>>> could
>>>>>>>> execute training once, or could execute it periodically e.g. by
>>> using
>>>>>>>> windows. Low-latency serving and training could be done in the
>> same
>>>>>>>> system.
>>>>>>>> We could implement incremental algorithms, without any side inputs
>>> for
>>>>>>>> combining online learning (or predictions) with batch learning. Of
>>>>>>>> course,
>>>>>>>> all the logic describing these must be somehow implemented (e.g.
>>>>>>>> synchronizing predictions with training), but it should be easier
>> to
>>>> do
>>>>>>>> so
>>>>>>>> in one system, than by combining e.g. the batch and streaming API.
>>>>>>>>
>>>>>>>> 2) Batch ML with the streaming API is not harder
>>>>>>>> Despite these benefits, it could seem harder to implement batch ML
>>>> with
>>>>>>>> the streaming API, but in my opinion it's not. There are more
>>>> flexible,
>>>>>>>> lower-level optimization potentials with the streaming API. Most
>>>>>>>> distributed ML algorithms use a lower-level model than the batch
>> API
>>>>>>>> anyway, so sometimes it feels like forcing the algorithm logic
>> into
>>>> the
>>>>>>>> training API and tweaking it. Although we could not use the batch
>>>>>>>> primitives like join, we would have the E.g. in my experience with
>>>>>>>> implementing a distributed matrix factorization algorithm [1], I
>>>> couldn't
>>>>>>>> do a simple optimization because of the limitations of the
>> iteration
>>>> API
>>>>>>>> [2]. Even if we pushed all the development effort to make the
>> batch
>>>> API
>>>>>>>> more suitable for ML there would be things we couldn't do. E.g.
>>> there
>>>> are
>>>>>>>> approaches for updating a model iteratively without locks [3,4]
>>> (i.e.
>>>>>>>> somewhat asynchronously), and I don't see a clear way to implement
>>>> such
>>>>>>>> algorithms with the batch API.
>>>>>>>>
>>>>>>>> 3) Streaming community (users and devs) benefit
>>>>>>>> The Flink streaming community in general would also benefit from
>>> this
>>>>>>>> direction. There are many features needed in the streaming API for
>>> ML
>>>> to
>>>>>>>> work, but this is also true for the batch API. One really
>> important
>>> is
>>>>>>>> the
>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot
>>> of
>>>>>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
>>>>>>>> mentioned
>>>>>>>> using GPUs, and I'm sure they have uses in streaming generally
>> [7].
>>>> Thus,
>>>>>>>> by improving the streaming API to allow ML algorithms, the
>> streaming
>>>> API
>>>>>>>> benefit too (which is important as they have a lot more production
>>>> users
>>>>>>>> than the batch API).
>>>>>>>>
>>>>>>>> 4) Performance can be at least as good
>>>>>>>> I believe the same performance could be achieved with the
>> streaming
>>>> API
>>>>>>>> as
>>>>>>>> with the batch API. Streaming API is much closer to the runtime
>> than
>>>> the
>>>>>>>> batch API. For corner-cases, with runtime-layer optimizations of
>>> batch
>>>>>>>> API,
>>>>>>>> we could find a way to do the same (or similar) optimization for
>> the
>>>>>>>> streaming API (see my previous point). Such case could be using
>>>> managed
>>>>>>>> memory (and spilling to disk). There are also benefits by default,
>>>> e.g.
>>>>>>>> we
>>>>>>>> would have a finer grained fault tolerance with the streaming API.
>>>>>>>>
>>>>>>>> 5) We could keep batch ML API
>>>>>>>> For the shorter term, we should not throw away all the algorithms
>>>>>>>> implemented with the batch API. By pushing forward the development
>>>> with
>>>>>>>> side inputs we could make them usable with streaming API. Then, if
>>> the
>>>>>>>> library gains some popularity, we could replace the algorithms in
>>> the
>>>>>>>> batch
>>>>>>>> API with streaming ones, to avoid the performance costs of e.g.
>> not
>>>> being
>>>>>>>> able to persist.
>>>>>>>>
>>>>>>>> 6) General tools for implementing ML algorithms
>>>>>>>> Besides implementing algorithms one by one, we could give more
>>> general
>>>>>>>> tools for making it easier to implement algorithms. E.g. parameter
>>>> server
>>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
>>>>>>>> similar
>>>>>>>> model to Flink streaming, we could look into that too. I think
>> often
>>>> when
>>>>>>>> deploying a production ML system, much more configuration and
>>> tweaking
>>>>>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>>>>>>>
>>>>>>>> 7) Showcasing
>>>>>>>> Showcasing this could be easier. We could say that we're doing
>> batch
>>>> ML
>>>>>>>> with a streaming API. That's interesting in its own. IMHO this
>>>>>>>> integration
>>>>>>>> is also a more approachable way towards end-to-end ML.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for reading so far :)
>>>>>>>>
>>>>>>>> [1] https://github.com/apache/flink/pull/2819
>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>>>>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
>>>>>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>>>>>>> 13-final77.pdf
>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>>>>>>> Scoped+Loops+and+Job+Termination
>>>>>>>> [6] https://github.com/apache/flink/pull/1668
>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.
>> pdf
>>>>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>>>>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>>>>>>> Parameter-Server-implementation-td15880.html
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Gabor
>>>>>>>>
>>>>>>>>
>>>> --
>>> *Yours faithfully, *
>>>
>>> *Kate Eri.*
>>>

Re: [DISCUSS] Flink ML roadmap

Posted by Katherin Eri <ka...@gmail.com>.

I have asked already some teams for useful cases, but all of them need time
to think.
During analysis something will finally arise.
May be we can ask partners of Flink  for cases? Data Artisans got results
of customers survey: [1], ML better support is wanted, so we could ask what
exactly is necessary.

[1] http://data-artisans.com/flink-user-survey-2016-part-2/

23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
st.kontopoulos@gmail.com> написал:

> +100 for a design doc.
>
> Could we also set a roadmap after some time-boxed investigation captured in
> that document? We need action.
>
> Looking forward to work on this (whatever that might be) ;) Also are there
> any data supporting one direction or the other from a customer perspective?
> It would help to make more informed decisions.
>
> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <ka...@gmail.com>
> wrote:
>
> > Yes, ok.
> > let's start some design document, and write down there already mentioned
> > ideas about: parameter server, about clipper and others. Would be nice if
> > we will also map this approaches to cases.
> > Will work on it collaboratively on each topic, may be finally we will
> form
> > some picture, that could be agreed with committers.
> > @Gabor, could you please start such shared doc, as you have already
> several
> > ideas proposed?
> >
> > чт, 23 февр. 2017, 15:06 Gábor Hermann <ma...@gaborhermann.com>:
> >
> > > I agree, that it's better to go in one direction first, but I think
> > > online and offline with streaming API can go somewhat parallel later.
> We
> > > could set a short-term goal, concentrate initially on one direction,
> and
> > > showcase that direction (e.g. in a blogpost). But first, we should list
> > > the pros/cons in a design doc as a minimum. Then make a decision what
> > > direction to go. Would that be feasible?
> > >
> > > On 2017-02-23 12:34, Katherin Eri wrote:
> > >
> > > > I'm not sure that this is feasible, doing all at the same time could
> > mean
> > > > doing nothing((((
> > > > I'm just afraid, that words: we will work on streaming not on
> batching,
> > > we
> > > > have no commiter's time for this, mean that yes, we started work on
> > > > FLINK-1730, but nobody will commit this work in the end, as it
> already
> > > was
> > > > with this ticket.
> > > >
> > > > 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
> > > mail@gaborhermann.com>
> > > > написал:
> > > >
> > > >> @Theodore: Great to hear you think the "batch on streaming" approach
> > is
> > > >> possible! Of course, we need to pay attention all the pitfalls
> there,
> > > if we
> > > >> go that way.
> > > >>
> > > >> +1 for a design doc!
> > > >>
> > > >> I would add that it's possible to make efforts in all the three
> > > directions
> > > >> (i.e. batch, online, batch on streaming) at the same time. Although,
> > it
> > > >> might be worth to concentrate on one. E.g. it would not be so useful
> > to
> > > >> have the same batch algorithms with both the batch API and streaming
> > > API.
> > > >> We can decide later.
> > > >>
> > > >> The design doc could be partitioned to these 3 directions, and we
> can
> > > >> collect there the pros/cons too. What do you think?
> > > >>
> > > >> Cheers,
> > > >> Gabor
> > > >>
> > > >>
> > > >> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
> > > >>
> > > >>> Hello all,
> > > >>>
> > > >>>
> > > >>> @Gabor, we have discussed the idea of using the streaming API to
> > write
> > > all
> > > >>> of our ML algorithms with a couple of people offline,
> > > >>> and I think it might be possible and is generally worth a shot. The
> > > >>> approach we would take would be close to Vowpal Wabbit, not exactly
> > > >>> "online", but rather "fast-batch".
> > > >>>
> > > >>> There will be problems popping up again, even for very simple algos
> > > like
> > > >>> on
> > > >>> line linear regression with SGD [1], but hopefully fixing those
> will
> > be
> > > >>> more aligned with the priorities of the community.
> > > >>>
> > > >>> @Katherin, my understanding is that given the limited resources,
> > there
> > > is
> > > >>> no development effort focused on batch processing right now.
> > > >>>
> > > >>> So to summarize, it seems like there are people willing to work on
> ML
> > > on
> > > >>> Flink, but nobody is sure how to do it.
> > > >>> There are many directions we could take (batch, online, batch on
> > > >>> streaming), each with its own merits and downsides.
> > > >>>
> > > >>> If you want we can start a design doc and move the conversation
> > there,
> > > >>> come
> > > >>> up with a roadmap and start implementing.
> > > >>>
> > > >>> Regards,
> > > >>> Theodore
> > > >>>
> > > >>> [1]
> > > >>> http://apache-flink-user-mailing-list-archive.2336050.n4.
> > > >>> nabble.com/Understanding-connected-streams-use-without-times
> > > >>> tamps-td10241.html
> > > >>>
> > > >>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
> > mail@gaborhermann.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>> It's great to see so much activity in this discussion :)
> > > >>>> I'll try to add my thoughts.
> > > >>>>
> > > >>>> I think building a developer community (Till's 2. point) can be
> > > slightly
> > > >>>> separated from what features we should aim for (1. point) and
> > > showcasing
> > > >>>> (3. point). Thanks Till for bringing up the ideas for
> restructuring,
> > > I'm
> > > >>>> sure we'll find a way to make the development process more
> dynamic.
> > > I'll
> > > >>>> try to address the rest here.
> > > >>>>
> > > >>>> It's hard to choose directions between streaming and batch ML. As
> > Theo
> > > >>>> has
> > > >>>> indicated, not much online ML is used in production, but Flink
> > > >>>> concentrates
> > > >>>> on streaming, so online ML would be a better fit for Flink.
> However,
> > > as
> > > >>>> most of you argued, there's definite need for batch ML. But batch
> ML
> > > >>>> seems
> > > >>>> hard to achieve because there are blocking issues with persisting,
> > > >>>> iteration paths etc. So it's no good either way.
> > > >>>>
> > > >>>> I propose a seemingly crazy solution: what if we developed batch
> > > >>>> algorithms also with the streaming API? The batch API would
> clearly
> > > seem
> > > >>>> more suitable for ML algorithms, but there a lot of benefits of
> this
> > > >>>> approach too, so it's clearly worth considering. Flink also has
> the
> > > high
> > > >>>> level vision of "streaming for everything" that would clearly fit
> > this
> > > >>>> case. What do you all think about this? Do you think this solution
> > > would
> > > >>>> be
> > > >>>> feasible? I would be happy to make a more elaborate proposal, but
> I
> > > push
> > > >>>> my
> > > >>>> main ideas here:
> > > >>>>
> > > >>>> 1) Simplifying by using one system
> > > >>>> It could simplify the work of both the users and the developers.
> One
> > > >>>> could
> > > >>>> execute training once, or could execute it periodically e.g. by
> > using
> > > >>>> windows. Low-latency serving and training could be done in the
> same
> > > >>>> system.
> > > >>>> We could implement incremental algorithms, without any side inputs
> > for
> > > >>>> combining online learning (or predictions) with batch learning. Of
> > > >>>> course,
> > > >>>> all the logic describing these must be somehow implemented (e.g.
> > > >>>> synchronizing predictions with training), but it should be easier
> to
> > > do
> > > >>>> so
> > > >>>> in one system, than by combining e.g. the batch and streaming API.
> > > >>>>
> > > >>>> 2) Batch ML with the streaming API is not harder
> > > >>>> Despite these benefits, it could seem harder to implement batch ML
> > > with
> > > >>>> the streaming API, but in my opinion it's not. There are more
> > > flexible,
> > > >>>> lower-level optimization potentials with the streaming API. Most
> > > >>>> distributed ML algorithms use a lower-level model than the batch
> API
> > > >>>> anyway, so sometimes it feels like forcing the algorithm logic
> into
> > > the
> > > >>>> training API and tweaking it. Although we could not use the batch
> > > >>>> primitives like join, we would have the E.g. in my experience with
> > > >>>> implementing a distributed matrix factorization algorithm [1], I
> > > couldn't
> > > >>>> do a simple optimization because of the limitations of the
> iteration
> > > API
> > > >>>> [2]. Even if we pushed all the development effort to make the
> batch
> > > API
> > > >>>> more suitable for ML there would be things we couldn't do. E.g.
> > there
> > > are
> > > >>>> approaches for updating a model iteratively without locks [3,4]
> > (i.e.
> > > >>>> somewhat asynchronously), and I don't see a clear way to implement
> > > such
> > > >>>> algorithms with the batch API.
> > > >>>>
> > > >>>> 3) Streaming community (users and devs) benefit
> > > >>>> The Flink streaming community in general would also benefit from
> > this
> > > >>>> direction. There are many features needed in the streaming API for
> > ML
> > > to
> > > >>>> work, but this is also true for the batch API. One really
> important
> > is
> > > >>>> the
> > > >>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot
> > of
> > > >>>> effort (mostly from Paris) for making it mature enough [6]. Kate
> > > >>>> mentioned
> > > >>>> using GPUs, and I'm sure they have uses in streaming generally
> [7].
> > > Thus,
> > > >>>> by improving the streaming API to allow ML algorithms, the
> streaming
> > > API
> > > >>>> benefit too (which is important as they have a lot more production
> > > users
> > > >>>> than the batch API).
> > > >>>>
> > > >>>> 4) Performance can be at least as good
> > > >>>> I believe the same performance could be achieved with the
> streaming
> > > API
> > > >>>> as
> > > >>>> with the batch API. Streaming API is much closer to the runtime
> than
> > > the
> > > >>>> batch API. For corner-cases, with runtime-layer optimizations of
> > batch
> > > >>>> API,
> > > >>>> we could find a way to do the same (or similar) optimization for
> the
> > > >>>> streaming API (see my previous point). Such case could be using
> > > managed
> > > >>>> memory (and spilling to disk). There are also benefits by default,
> > > e.g.
> > > >>>> we
> > > >>>> would have a finer grained fault tolerance with the streaming API.
> > > >>>>
> > > >>>> 5) We could keep batch ML API
> > > >>>> For the shorter term, we should not throw away all the algorithms
> > > >>>> implemented with the batch API. By pushing forward the development
> > > with
> > > >>>> side inputs we could make them usable with streaming API. Then, if
> > the
> > > >>>> library gains some popularity, we could replace the algorithms in
> > the
> > > >>>> batch
> > > >>>> API with streaming ones, to avoid the performance costs of e.g.
> not
> > > being
> > > >>>> able to persist.
> > > >>>>
> > > >>>> 6) General tools for implementing ML algorithms
> > > >>>> Besides implementing algorithms one by one, we could give more
> > general
> > > >>>> tools for making it easier to implement algorithms. E.g. parameter
> > > server
> > > >>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
> > > >>>> similar
> > > >>>> model to Flink streaming, we could look into that too. I think
> often
> > > when
> > > >>>> deploying a production ML system, much more configuration and
> > tweaking
> > > >>>> should be done than e.g. Spark MLlib allows. Why not allow that?
> > > >>>>
> > > >>>> 7) Showcasing
> > > >>>> Showcasing this could be easier. We could say that we're doing
> batch
> > > ML
> > > >>>> with a streaming API. That's interesting in its own. IMHO this
> > > >>>> integration
> > > >>>> is also a more approachable way towards end-to-end ML.
> > > >>>>
> > > >>>>
> > > >>>> Thanks for reading so far :)
> > > >>>>
> > > >>>> [1] https://github.com/apache/flink/pull/2819
> > > >>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
> > > >>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
> > > >>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
> > > >>>> 13-final77.pdf
> > > >>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
> > > >>>> Scoped+Loops+and+Job+Termination
> > > >>>> [6] https://github.com/apache/flink/pull/1668
> > > >>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.
> pdf
> > > >>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
> > > >>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > > >>>> com/Using-QueryableState-inside-Flink-jobs-and-
> > > >>>> Parameter-Server-implementation-td15880.html
> > > >>>>
> > > >>>> Cheers,
> > > >>>> Gabor
> > > >>>>
> > > >>>>
> > >
> > > --
> >
> > *Yours faithfully, *
> >
> > *Kate Eri.*
> >
>

Re: [DISCUSS] Flink ML roadmap

Posted by Stavros Kontopoulos <st...@gmail.com>.

+100 for a design doc.

Could we also set a roadmap after some time-boxed investigation captured in
that document? We need action.

Looking forward to work on this (whatever that might be) ;) Also are there
any data supporting one direction or the other from a customer perspective?
It would help to make more informed decisions.

On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <ka...@gmail.com>
wrote:

> Yes, ok.
> let's start some design document, and write down there already mentioned
> ideas about: parameter server, about clipper and others. Would be nice if
> we will also map this approaches to cases.
> Will work on it collaboratively on each topic, may be finally we will form
> some picture, that could be agreed with committers.
> @Gabor, could you please start such shared doc, as you have already several
> ideas proposed?
>
> чт, 23 февр. 2017, 15:06 Gábor Hermann <ma...@gaborhermann.com>:
>
> > I agree, that it's better to go in one direction first, but I think
> > online and offline with streaming API can go somewhat parallel later. We
> > could set a short-term goal, concentrate initially on one direction, and
> > showcase that direction (e.g. in a blogpost). But first, we should list
> > the pros/cons in a design doc as a minimum. Then make a decision what
> > direction to go. Would that be feasible?
> >
> > On 2017-02-23 12:34, Katherin Eri wrote:
> >
> > > I'm not sure that this is feasible, doing all at the same time could
> mean
> > > doing nothing((((
> > > I'm just afraid, that words: we will work on streaming not on batching,
> > we
> > > have no commiter's time for this, mean that yes, we started work on
> > > FLINK-1730, but nobody will commit this work in the end, as it already
> > was
> > > with this ticket.
> > >
> > > 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
> > mail@gaborhermann.com>
> > > написал:
> > >
> > >> @Theodore: Great to hear you think the "batch on streaming" approach
> is
> > >> possible! Of course, we need to pay attention all the pitfalls there,
> > if we
> > >> go that way.
> > >>
> > >> +1 for a design doc!
> > >>
> > >> I would add that it's possible to make efforts in all the three
> > directions
> > >> (i.e. batch, online, batch on streaming) at the same time. Although,
> it
> > >> might be worth to concentrate on one. E.g. it would not be so useful
> to
> > >> have the same batch algorithms with both the batch API and streaming
> > API.
> > >> We can decide later.
> > >>
> > >> The design doc could be partitioned to these 3 directions, and we can
> > >> collect there the pros/cons too. What do you think?
> > >>
> > >> Cheers,
> > >> Gabor
> > >>
> > >>
> > >> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
> > >>
> > >>> Hello all,
> > >>>
> > >>>
> > >>> @Gabor, we have discussed the idea of using the streaming API to
> write
> > all
> > >>> of our ML algorithms with a couple of people offline,
> > >>> and I think it might be possible and is generally worth a shot. The
> > >>> approach we would take would be close to Vowpal Wabbit, not exactly
> > >>> "online", but rather "fast-batch".
> > >>>
> > >>> There will be problems popping up again, even for very simple algos
> > like
> > >>> on
> > >>> line linear regression with SGD [1], but hopefully fixing those will
> be
> > >>> more aligned with the priorities of the community.
> > >>>
> > >>> @Katherin, my understanding is that given the limited resources,
> there
> > is
> > >>> no development effort focused on batch processing right now.
> > >>>
> > >>> So to summarize, it seems like there are people willing to work on ML
> > on
> > >>> Flink, but nobody is sure how to do it.
> > >>> There are many directions we could take (batch, online, batch on
> > >>> streaming), each with its own merits and downsides.
> > >>>
> > >>> If you want we can start a design doc and move the conversation
> there,
> > >>> come
> > >>> up with a roadmap and start implementing.
> > >>>
> > >>> Regards,
> > >>> Theodore
> > >>>
> > >>> [1]
> > >>> http://apache-flink-user-mailing-list-archive.2336050.n4.
> > >>> nabble.com/Understanding-connected-streams-use-without-times
> > >>> tamps-td10241.html
> > >>>
> > >>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
> mail@gaborhermann.com
> > >
> > >>> wrote:
> > >>>
> > >>> It's great to see so much activity in this discussion :)
> > >>>> I'll try to add my thoughts.
> > >>>>
> > >>>> I think building a developer community (Till's 2. point) can be
> > slightly
> > >>>> separated from what features we should aim for (1. point) and
> > showcasing
> > >>>> (3. point). Thanks Till for bringing up the ideas for restructuring,
> > I'm
> > >>>> sure we'll find a way to make the development process more dynamic.
> > I'll
> > >>>> try to address the rest here.
> > >>>>
> > >>>> It's hard to choose directions between streaming and batch ML. As
> Theo
> > >>>> has
> > >>>> indicated, not much online ML is used in production, but Flink
> > >>>> concentrates
> > >>>> on streaming, so online ML would be a better fit for Flink. However,
> > as
> > >>>> most of you argued, there's definite need for batch ML. But batch ML
> > >>>> seems
> > >>>> hard to achieve because there are blocking issues with persisting,
> > >>>> iteration paths etc. So it's no good either way.
> > >>>>
> > >>>> I propose a seemingly crazy solution: what if we developed batch
> > >>>> algorithms also with the streaming API? The batch API would clearly
> > seem
> > >>>> more suitable for ML algorithms, but there a lot of benefits of this
> > >>>> approach too, so it's clearly worth considering. Flink also has the
> > high
> > >>>> level vision of "streaming for everything" that would clearly fit
> this
> > >>>> case. What do you all think about this? Do you think this solution
> > would
> > >>>> be
> > >>>> feasible? I would be happy to make a more elaborate proposal, but I
> > push
> > >>>> my
> > >>>> main ideas here:
> > >>>>
> > >>>> 1) Simplifying by using one system
> > >>>> It could simplify the work of both the users and the developers. One
> > >>>> could
> > >>>> execute training once, or could execute it periodically e.g. by
> using
> > >>>> windows. Low-latency serving and training could be done in the same
> > >>>> system.
> > >>>> We could implement incremental algorithms, without any side inputs
> for
> > >>>> combining online learning (or predictions) with batch learning. Of
> > >>>> course,
> > >>>> all the logic describing these must be somehow implemented (e.g.
> > >>>> synchronizing predictions with training), but it should be easier to
> > do
> > >>>> so
> > >>>> in one system, than by combining e.g. the batch and streaming API.
> > >>>>
> > >>>> 2) Batch ML with the streaming API is not harder
> > >>>> Despite these benefits, it could seem harder to implement batch ML
> > with
> > >>>> the streaming API, but in my opinion it's not. There are more
> > flexible,
> > >>>> lower-level optimization potentials with the streaming API. Most
> > >>>> distributed ML algorithms use a lower-level model than the batch API
> > >>>> anyway, so sometimes it feels like forcing the algorithm logic into
> > the
> > >>>> training API and tweaking it. Although we could not use the batch
> > >>>> primitives like join, we would have the E.g. in my experience with
> > >>>> implementing a distributed matrix factorization algorithm [1], I
> > couldn't
> > >>>> do a simple optimization because of the limitations of the iteration
> > API
> > >>>> [2]. Even if we pushed all the development effort to make the batch
> > API
> > >>>> more suitable for ML there would be things we couldn't do. E.g.
> there
> > are
> > >>>> approaches for updating a model iteratively without locks [3,4]
> (i.e.
> > >>>> somewhat asynchronously), and I don't see a clear way to implement
> > such
> > >>>> algorithms with the batch API.
> > >>>>
> > >>>> 3) Streaming community (users and devs) benefit
> > >>>> The Flink streaming community in general would also benefit from
> this
> > >>>> direction. There are many features needed in the streaming API for
> ML
> > to
> > >>>> work, but this is also true for the batch API. One really important
> is
> > >>>> the
> > >>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot
> of
> > >>>> effort (mostly from Paris) for making it mature enough [6]. Kate
> > >>>> mentioned
> > >>>> using GPUs, and I'm sure they have uses in streaming generally [7].
> > Thus,
> > >>>> by improving the streaming API to allow ML algorithms, the streaming
> > API
> > >>>> benefit too (which is important as they have a lot more production
> > users
> > >>>> than the batch API).
> > >>>>
> > >>>> 4) Performance can be at least as good
> > >>>> I believe the same performance could be achieved with the streaming
> > API
> > >>>> as
> > >>>> with the batch API. Streaming API is much closer to the runtime than
> > the
> > >>>> batch API. For corner-cases, with runtime-layer optimizations of
> batch
> > >>>> API,
> > >>>> we could find a way to do the same (or similar) optimization for the
> > >>>> streaming API (see my previous point). Such case could be using
> > managed
> > >>>> memory (and spilling to disk). There are also benefits by default,
> > e.g.
> > >>>> we
> > >>>> would have a finer grained fault tolerance with the streaming API.
> > >>>>
> > >>>> 5) We could keep batch ML API
> > >>>> For the shorter term, we should not throw away all the algorithms
> > >>>> implemented with the batch API. By pushing forward the development
> > with
> > >>>> side inputs we could make them usable with streaming API. Then, if
> the
> > >>>> library gains some popularity, we could replace the algorithms in
> the
> > >>>> batch
> > >>>> API with streaming ones, to avoid the performance costs of e.g. not
> > being
> > >>>> able to persist.
> > >>>>
> > >>>> 6) General tools for implementing ML algorithms
> > >>>> Besides implementing algorithms one by one, we could give more
> general
> > >>>> tools for making it easier to implement algorithms. E.g. parameter
> > server
> > >>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
> > >>>> similar
> > >>>> model to Flink streaming, we could look into that too. I think often
> > when
> > >>>> deploying a production ML system, much more configuration and
> tweaking
> > >>>> should be done than e.g. Spark MLlib allows. Why not allow that?
> > >>>>
> > >>>> 7) Showcasing
> > >>>> Showcasing this could be easier. We could say that we're doing batch
> > ML
> > >>>> with a streaming API. That's interesting in its own. IMHO this
> > >>>> integration
> > >>>> is also a more approachable way towards end-to-end ML.
> > >>>>
> > >>>>
> > >>>> Thanks for reading so far :)
> > >>>>
> > >>>> [1] https://github.com/apache/flink/pull/2819
> > >>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
> > >>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
> > >>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
> > >>>> 13-final77.pdf
> > >>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
> > >>>> Scoped+Loops+and+Job+Termination
> > >>>> [6] https://github.com/apache/flink/pull/1668
> > >>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf
> > >>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
> > >>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > >>>> com/Using-QueryableState-inside-Flink-jobs-and-
> > >>>> Parameter-Server-implementation-td15880.html
> > >>>>
> > >>>> Cheers,
> > >>>> Gabor
> > >>>>
> > >>>>
> >
> > --
>
> *Yours faithfully, *
>
> *Kate Eri.*
>

Re: [DISCUSS] Flink ML roadmap

Posted by Katherin Eri <ka...@gmail.com>.

Yes, ok.
let's start some design document, and write down there already mentioned
ideas about: parameter server, about clipper and others. Would be nice if
we will also map this approaches to cases.
Will work on it collaboratively on each topic, may be finally we will form
some picture, that could be agreed with committers.
@Gabor, could you please start such shared doc, as you have already several
ideas proposed?

чт, 23 февр. 2017, 15:06 Gábor Hermann <ma...@gaborhermann.com>:

> I agree, that it's better to go in one direction first, but I think
> online and offline with streaming API can go somewhat parallel later. We
> could set a short-term goal, concentrate initially on one direction, and
> showcase that direction (e.g. in a blogpost). But first, we should list
> the pros/cons in a design doc as a minimum. Then make a decision what
> direction to go. Would that be feasible?
>
> On 2017-02-23 12:34, Katherin Eri wrote:
>
> > I'm not sure that this is feasible, doing all at the same time could mean
> > doing nothing((((
> > I'm just afraid, that words: we will work on streaming not on batching,
> we
> > have no commiter's time for this, mean that yes, we started work on
> > FLINK-1730, but nobody will commit this work in the end, as it already
> was
> > with this ticket.
> >
> > 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
> mail@gaborhermann.com>
> > написал:
> >
> >> @Theodore: Great to hear you think the "batch on streaming" approach is
> >> possible! Of course, we need to pay attention all the pitfalls there,
> if we
> >> go that way.
> >>
> >> +1 for a design doc!
> >>
> >> I would add that it's possible to make efforts in all the three
> directions
> >> (i.e. batch, online, batch on streaming) at the same time. Although, it
> >> might be worth to concentrate on one. E.g. it would not be so useful to
> >> have the same batch algorithms with both the batch API and streaming
> API.
> >> We can decide later.
> >>
> >> The design doc could be partitioned to these 3 directions, and we can
> >> collect there the pros/cons too. What do you think?
> >>
> >> Cheers,
> >> Gabor
> >>
> >>
> >> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
> >>
> >>> Hello all,
> >>>
> >>>
> >>> @Gabor, we have discussed the idea of using the streaming API to write
> all
> >>> of our ML algorithms with a couple of people offline,
> >>> and I think it might be possible and is generally worth a shot. The
> >>> approach we would take would be close to Vowpal Wabbit, not exactly
> >>> "online", but rather "fast-batch".
> >>>
> >>> There will be problems popping up again, even for very simple algos
> like
> >>> on
> >>> line linear regression with SGD [1], but hopefully fixing those will be
> >>> more aligned with the priorities of the community.
> >>>
> >>> @Katherin, my understanding is that given the limited resources, there
> is
> >>> no development effort focused on batch processing right now.
> >>>
> >>> So to summarize, it seems like there are people willing to work on ML
> on
> >>> Flink, but nobody is sure how to do it.
> >>> There are many directions we could take (batch, online, batch on
> >>> streaming), each with its own merits and downsides.
> >>>
> >>> If you want we can start a design doc and move the conversation there,
> >>> come
> >>> up with a roadmap and start implementing.
> >>>
> >>> Regards,
> >>> Theodore
> >>>
> >>> [1]
> >>> http://apache-flink-user-mailing-list-archive.2336050.n4.
> >>> nabble.com/Understanding-connected-streams-use-without-times
> >>> tamps-td10241.html
> >>>
> >>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <mail@gaborhermann.com
> >
> >>> wrote:
> >>>
> >>> It's great to see so much activity in this discussion :)
> >>>> I'll try to add my thoughts.
> >>>>
> >>>> I think building a developer community (Till's 2. point) can be
> slightly
> >>>> separated from what features we should aim for (1. point) and
> showcasing
> >>>> (3. point). Thanks Till for bringing up the ideas for restructuring,
> I'm
> >>>> sure we'll find a way to make the development process more dynamic.
> I'll
> >>>> try to address the rest here.
> >>>>
> >>>> It's hard to choose directions between streaming and batch ML. As Theo
> >>>> has
> >>>> indicated, not much online ML is used in production, but Flink
> >>>> concentrates
> >>>> on streaming, so online ML would be a better fit for Flink. However,
> as
> >>>> most of you argued, there's definite need for batch ML. But batch ML
> >>>> seems
> >>>> hard to achieve because there are blocking issues with persisting,
> >>>> iteration paths etc. So it's no good either way.
> >>>>
> >>>> I propose a seemingly crazy solution: what if we developed batch
> >>>> algorithms also with the streaming API? The batch API would clearly
> seem
> >>>> more suitable for ML algorithms, but there a lot of benefits of this
> >>>> approach too, so it's clearly worth considering. Flink also has the
> high
> >>>> level vision of "streaming for everything" that would clearly fit this
> >>>> case. What do you all think about this? Do you think this solution
> would
> >>>> be
> >>>> feasible? I would be happy to make a more elaborate proposal, but I
> push
> >>>> my
> >>>> main ideas here:
> >>>>
> >>>> 1) Simplifying by using one system
> >>>> It could simplify the work of both the users and the developers. One
> >>>> could
> >>>> execute training once, or could execute it periodically e.g. by using
> >>>> windows. Low-latency serving and training could be done in the same
> >>>> system.
> >>>> We could implement incremental algorithms, without any side inputs for
> >>>> combining online learning (or predictions) with batch learning. Of
> >>>> course,
> >>>> all the logic describing these must be somehow implemented (e.g.
> >>>> synchronizing predictions with training), but it should be easier to
> do
> >>>> so
> >>>> in one system, than by combining e.g. the batch and streaming API.
> >>>>
> >>>> 2) Batch ML with the streaming API is not harder
> >>>> Despite these benefits, it could seem harder to implement batch ML
> with
> >>>> the streaming API, but in my opinion it's not. There are more
> flexible,
> >>>> lower-level optimization potentials with the streaming API. Most
> >>>> distributed ML algorithms use a lower-level model than the batch API
> >>>> anyway, so sometimes it feels like forcing the algorithm logic into
> the
> >>>> training API and tweaking it. Although we could not use the batch
> >>>> primitives like join, we would have the E.g. in my experience with
> >>>> implementing a distributed matrix factorization algorithm [1], I
> couldn't
> >>>> do a simple optimization because of the limitations of the iteration
> API
> >>>> [2]. Even if we pushed all the development effort to make the batch
> API
> >>>> more suitable for ML there would be things we couldn't do. E.g. there
> are
> >>>> approaches for updating a model iteratively without locks [3,4] (i.e.
> >>>> somewhat asynchronously), and I don't see a clear way to implement
> such
> >>>> algorithms with the batch API.
> >>>>
> >>>> 3) Streaming community (users and devs) benefit
> >>>> The Flink streaming community in general would also benefit from this
> >>>> direction. There are many features needed in the streaming API for ML
> to
> >>>> work, but this is also true for the batch API. One really important is
> >>>> the
> >>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot of
> >>>> effort (mostly from Paris) for making it mature enough [6]. Kate
> >>>> mentioned
> >>>> using GPUs, and I'm sure they have uses in streaming generally [7].
> Thus,
> >>>> by improving the streaming API to allow ML algorithms, the streaming
> API
> >>>> benefit too (which is important as they have a lot more production
> users
> >>>> than the batch API).
> >>>>
> >>>> 4) Performance can be at least as good
> >>>> I believe the same performance could be achieved with the streaming
> API
> >>>> as
> >>>> with the batch API. Streaming API is much closer to the runtime than
> the
> >>>> batch API. For corner-cases, with runtime-layer optimizations of batch
> >>>> API,
> >>>> we could find a way to do the same (or similar) optimization for the
> >>>> streaming API (see my previous point). Such case could be using
> managed
> >>>> memory (and spilling to disk). There are also benefits by default,
> e.g.
> >>>> we
> >>>> would have a finer grained fault tolerance with the streaming API.
> >>>>
> >>>> 5) We could keep batch ML API
> >>>> For the shorter term, we should not throw away all the algorithms
> >>>> implemented with the batch API. By pushing forward the development
> with
> >>>> side inputs we could make them usable with streaming API. Then, if the
> >>>> library gains some popularity, we could replace the algorithms in the
> >>>> batch
> >>>> API with streaming ones, to avoid the performance costs of e.g. not
> being
> >>>> able to persist.
> >>>>
> >>>> 6) General tools for implementing ML algorithms
> >>>> Besides implementing algorithms one by one, we could give more general
> >>>> tools for making it easier to implement algorithms. E.g. parameter
> server
> >>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
> >>>> similar
> >>>> model to Flink streaming, we could look into that too. I think often
> when
> >>>> deploying a production ML system, much more configuration and tweaking
> >>>> should be done than e.g. Spark MLlib allows. Why not allow that?
> >>>>
> >>>> 7) Showcasing
> >>>> Showcasing this could be easier. We could say that we're doing batch
> ML
> >>>> with a streaming API. That's interesting in its own. IMHO this
> >>>> integration
> >>>> is also a more approachable way towards end-to-end ML.
> >>>>
> >>>>
> >>>> Thanks for reading so far :)
> >>>>
> >>>> [1] https://github.com/apache/flink/pull/2819
> >>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
> >>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
> >>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
> >>>> 13-final77.pdf
> >>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
> >>>> Scoped+Loops+and+Job+Termination
> >>>> [6] https://github.com/apache/flink/pull/1668
> >>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf
> >>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
> >>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> >>>> com/Using-QueryableState-inside-Flink-jobs-and-
> >>>> Parameter-Server-implementation-td15880.html
> >>>>
> >>>> Cheers,
> >>>> Gabor
> >>>>
> >>>>
>
> --

*Yours faithfully, *

*Kate Eri.*

Re: [DISCUSS] Flink ML roadmap

Posted by Gábor Hermann <ma...@gaborhermann.com>.

I agree, that it's better to go in one direction first, but I think 
online and offline with streaming API can go somewhat parallel later. We 
could set a short-term goal, concentrate initially on one direction, and 
showcase that direction (e.g. in a blogpost). But first, we should list 
the pros/cons in a design doc as a minimum. Then make a decision what 
direction to go. Would that be feasible?

On 2017-02-23 12:34, Katherin Eri wrote:

> I'm not sure that this is feasible, doing all at the same time could mean
> doing nothing((((
> I'm just afraid, that words: we will work on streaming not on batching, we
> have no commiter's time for this, mean that yes, we started work on
> FLINK-1730, but nobody will commit this work in the end, as it already was
> with this ticket.
>
> 23 \u0444\u0435\u0432\u0440. 2017 \u0433. 14:26 \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044c "G�bor Hermann" <ma...@gaborhermann.com>
> \u043d\u0430\u043f\u0438\u0441\u0430\u043b:
>
>> @Theodore: Great to hear you think the "batch on streaming" approach is
>> possible! Of course, we need to pay attention all the pitfalls there, if we
>> go that way.
>>
>> +1 for a design doc!
>>
>> I would add that it's possible to make efforts in all the three directions
>> (i.e. batch, online, batch on streaming) at the same time. Although, it
>> might be worth to concentrate on one. E.g. it would not be so useful to
>> have the same batch algorithms with both the batch API and streaming API.
>> We can decide later.
>>
>> The design doc could be partitioned to these 3 directions, and we can
>> collect there the pros/cons too. What do you think?
>>
>> Cheers,
>> Gabor
>>
>>
>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>>
>>> Hello all,
>>>
>>>
>>> @Gabor, we have discussed the idea of using the streaming API to write all
>>> of our ML algorithms with a couple of people offline,
>>> and I think it might be possible and is generally worth a shot. The
>>> approach we would take would be close to Vowpal Wabbit, not exactly
>>> "online", but rather "fast-batch".
>>>
>>> There will be problems popping up again, even for very simple algos like
>>> on
>>> line linear regression with SGD [1], but hopefully fixing those will be
>>> more aligned with the priorities of the community.
>>>
>>> @Katherin, my understanding is that given the limited resources, there is
>>> no development effort focused on batch processing right now.
>>>
>>> So to summarize, it seems like there are people willing to work on ML on
>>> Flink, but nobody is sure how to do it.
>>> There are many directions we could take (batch, online, batch on
>>> streaming), each with its own merits and downsides.
>>>
>>> If you want we can start a design doc and move the conversation there,
>>> come
>>> up with a roadmap and start implementing.
>>>
>>> Regards,
>>> Theodore
>>>
>>> [1]
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>> nabble.com/Understanding-connected-streams-use-without-times
>>> tamps-td10241.html
>>>
>>> On Tue, Feb 21, 2017 at 11:17 PM, G�bor Hermann <ma...@gaborhermann.com>
>>> wrote:
>>>
>>> It's great to see so much activity in this discussion :)
>>>> I'll try to add my thoughts.
>>>>
>>>> I think building a developer community (Till's 2. point) can be slightly
>>>> separated from what features we should aim for (1. point) and showcasing
>>>> (3. point). Thanks Till for bringing up the ideas for restructuring, I'm
>>>> sure we'll find a way to make the development process more dynamic. I'll
>>>> try to address the rest here.
>>>>
>>>> It's hard to choose directions between streaming and batch ML. As Theo
>>>> has
>>>> indicated, not much online ML is used in production, but Flink
>>>> concentrates
>>>> on streaming, so online ML would be a better fit for Flink. However, as
>>>> most of you argued, there's definite need for batch ML. But batch ML
>>>> seems
>>>> hard to achieve because there are blocking issues with persisting,
>>>> iteration paths etc. So it's no good either way.
>>>>
>>>> I propose a seemingly crazy solution: what if we developed batch
>>>> algorithms also with the streaming API? The batch API would clearly seem
>>>> more suitable for ML algorithms, but there a lot of benefits of this
>>>> approach too, so it's clearly worth considering. Flink also has the high
>>>> level vision of "streaming for everything" that would clearly fit this
>>>> case. What do you all think about this? Do you think this solution would
>>>> be
>>>> feasible? I would be happy to make a more elaborate proposal, but I push
>>>> my
>>>> main ideas here:
>>>>
>>>> 1) Simplifying by using one system
>>>> It could simplify the work of both the users and the developers. One
>>>> could
>>>> execute training once, or could execute it periodically e.g. by using
>>>> windows. Low-latency serving and training could be done in the same
>>>> system.
>>>> We could implement incremental algorithms, without any side inputs for
>>>> combining online learning (or predictions) with batch learning. Of
>>>> course,
>>>> all the logic describing these must be somehow implemented (e.g.
>>>> synchronizing predictions with training), but it should be easier to do
>>>> so
>>>> in one system, than by combining e.g. the batch and streaming API.
>>>>
>>>> 2) Batch ML with the streaming API is not harder
>>>> Despite these benefits, it could seem harder to implement batch ML with
>>>> the streaming API, but in my opinion it's not. There are more flexible,
>>>> lower-level optimization potentials with the streaming API. Most
>>>> distributed ML algorithms use a lower-level model than the batch API
>>>> anyway, so sometimes it feels like forcing the algorithm logic into the
>>>> training API and tweaking it. Although we could not use the batch
>>>> primitives like join, we would have the E.g. in my experience with
>>>> implementing a distributed matrix factorization algorithm [1], I couldn't
>>>> do a simple optimization because of the limitations of the iteration API
>>>> [2]. Even if we pushed all the development effort to make the batch API
>>>> more suitable for ML there would be things we couldn't do. E.g. there are
>>>> approaches for updating a model iteratively without locks [3,4] (i.e.
>>>> somewhat asynchronously), and I don't see a clear way to implement such
>>>> algorithms with the batch API.
>>>>
>>>> 3) Streaming community (users and devs) benefit
>>>> The Flink streaming community in general would also benefit from this
>>>> direction. There are many features needed in the streaming API for ML to
>>>> work, but this is also true for the batch API. One really important is
>>>> the
>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot of
>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
>>>> mentioned
>>>> using GPUs, and I'm sure they have uses in streaming generally [7]. Thus,
>>>> by improving the streaming API to allow ML algorithms, the streaming API
>>>> benefit too (which is important as they have a lot more production users
>>>> than the batch API).
>>>>
>>>> 4) Performance can be at least as good
>>>> I believe the same performance could be achieved with the streaming API
>>>> as
>>>> with the batch API. Streaming API is much closer to the runtime than the
>>>> batch API. For corner-cases, with runtime-layer optimizations of batch
>>>> API,
>>>> we could find a way to do the same (or similar) optimization for the
>>>> streaming API (see my previous point). Such case could be using managed
>>>> memory (and spilling to disk). There are also benefits by default, e.g.
>>>> we
>>>> would have a finer grained fault tolerance with the streaming API.
>>>>
>>>> 5) We could keep batch ML API
>>>> For the shorter term, we should not throw away all the algorithms
>>>> implemented with the batch API. By pushing forward the development with
>>>> side inputs we could make them usable with streaming API. Then, if the
>>>> library gains some popularity, we could replace the algorithms in the
>>>> batch
>>>> API with streaming ones, to avoid the performance costs of e.g. not being
>>>> able to persist.
>>>>
>>>> 6) General tools for implementing ML algorithms
>>>> Besides implementing algorithms one by one, we could give more general
>>>> tools for making it easier to implement algorithms. E.g. parameter server
>>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
>>>> similar
>>>> model to Flink streaming, we could look into that too. I think often when
>>>> deploying a production ML system, much more configuration and tweaking
>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>>>
>>>> 7) Showcasing
>>>> Showcasing this could be easier. We could say that we're doing batch ML
>>>> with a streaming API. That's interesting in its own. IMHO this
>>>> integration
>>>> is also a more approachable way towards end-to-end ML.
>>>>
>>>>
>>>> Thanks for reading so far :)
>>>>
>>>> [1] https://github.com/apache/flink/pull/2819
>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>>> 13-final77.pdf
>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>>> Scoped+Loops+and+Job+Termination
>>>> [6] https://github.com/apache/flink/pull/1668
>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf
>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>>> Parameter-Server-implementation-td15880.html
>>>>
>>>> Cheers,
>>>> Gabor
>>>>
>>>>

Re: [DISCUSS] Flink ML roadmap

Posted by Katherin Eri <ka...@gmail.com>.

I'm not sure that this is feasible, doing all at the same time could mean
doing nothing((((
I'm just afraid, that words: we will work on streaming not on batching, we
have no commiter's time for this, mean that yes, we started work on
FLINK-1730, but nobody will commit this work in the end, as it already was
with this ticket.

23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <ma...@gaborhermann.com>
написал:

> @Theodore: Great to hear you think the "batch on streaming" approach is
> possible! Of course, we need to pay attention all the pitfalls there, if we
> go that way.
>
> +1 for a design doc!
>
> I would add that it's possible to make efforts in all the three directions
> (i.e. batch, online, batch on streaming) at the same time. Although, it
> might be worth to concentrate on one. E.g. it would not be so useful to
> have the same batch algorithms with both the batch API and streaming API.
> We can decide later.
>
> The design doc could be partitioned to these 3 directions, and we can
> collect there the pros/cons too. What do you think?
>
> Cheers,
> Gabor
>
>
> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>
>> Hello all,
>>
>>
>> @Gabor, we have discussed the idea of using the streaming API to write all
>> of our ML algorithms with a couple of people offline,
>> and I think it might be possible and is generally worth a shot. The
>> approach we would take would be close to Vowpal Wabbit, not exactly
>> "online", but rather "fast-batch".
>>
>> There will be problems popping up again, even for very simple algos like
>> on
>> line linear regression with SGD [1], but hopefully fixing those will be
>> more aligned with the priorities of the community.
>>
>> @Katherin, my understanding is that given the limited resources, there is
>> no development effort focused on batch processing right now.
>>
>> So to summarize, it seems like there are people willing to work on ML on
>> Flink, but nobody is sure how to do it.
>> There are many directions we could take (batch, online, batch on
>> streaming), each with its own merits and downsides.
>>
>> If you want we can start a design doc and move the conversation there,
>> come
>> up with a roadmap and start implementing.
>>
>> Regards,
>> Theodore
>>
>> [1]
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Understanding-connected-streams-use-without-times
>> tamps-td10241.html
>>
>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <ma...@gaborhermann.com>
>> wrote:
>>
>> It's great to see so much activity in this discussion :)
>>> I'll try to add my thoughts.
>>>
>>> I think building a developer community (Till's 2. point) can be slightly
>>> separated from what features we should aim for (1. point) and showcasing
>>> (3. point). Thanks Till for bringing up the ideas for restructuring, I'm
>>> sure we'll find a way to make the development process more dynamic. I'll
>>> try to address the rest here.
>>>
>>> It's hard to choose directions between streaming and batch ML. As Theo
>>> has
>>> indicated, not much online ML is used in production, but Flink
>>> concentrates
>>> on streaming, so online ML would be a better fit for Flink. However, as
>>> most of you argued, there's definite need for batch ML. But batch ML
>>> seems
>>> hard to achieve because there are blocking issues with persisting,
>>> iteration paths etc. So it's no good either way.
>>>
>>> I propose a seemingly crazy solution: what if we developed batch
>>> algorithms also with the streaming API? The batch API would clearly seem
>>> more suitable for ML algorithms, but there a lot of benefits of this
>>> approach too, so it's clearly worth considering. Flink also has the high
>>> level vision of "streaming for everything" that would clearly fit this
>>> case. What do you all think about this? Do you think this solution would
>>> be
>>> feasible? I would be happy to make a more elaborate proposal, but I push
>>> my
>>> main ideas here:
>>>
>>> 1) Simplifying by using one system
>>> It could simplify the work of both the users and the developers. One
>>> could
>>> execute training once, or could execute it periodically e.g. by using
>>> windows. Low-latency serving and training could be done in the same
>>> system.
>>> We could implement incremental algorithms, without any side inputs for
>>> combining online learning (or predictions) with batch learning. Of
>>> course,
>>> all the logic describing these must be somehow implemented (e.g.
>>> synchronizing predictions with training), but it should be easier to do
>>> so
>>> in one system, than by combining e.g. the batch and streaming API.
>>>
>>> 2) Batch ML with the streaming API is not harder
>>> Despite these benefits, it could seem harder to implement batch ML with
>>> the streaming API, but in my opinion it's not. There are more flexible,
>>> lower-level optimization potentials with the streaming API. Most
>>> distributed ML algorithms use a lower-level model than the batch API
>>> anyway, so sometimes it feels like forcing the algorithm logic into the
>>> training API and tweaking it. Although we could not use the batch
>>> primitives like join, we would have the E.g. in my experience with
>>> implementing a distributed matrix factorization algorithm [1], I couldn't
>>> do a simple optimization because of the limitations of the iteration API
>>> [2]. Even if we pushed all the development effort to make the batch API
>>> more suitable for ML there would be things we couldn't do. E.g. there are
>>> approaches for updating a model iteratively without locks [3,4] (i.e.
>>> somewhat asynchronously), and I don't see a clear way to implement such
>>> algorithms with the batch API.
>>>
>>> 3) Streaming community (users and devs) benefit
>>> The Flink streaming community in general would also benefit from this
>>> direction. There are many features needed in the streaming API for ML to
>>> work, but this is also true for the batch API. One really important is
>>> the
>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot of
>>> effort (mostly from Paris) for making it mature enough [6]. Kate
>>> mentioned
>>> using GPUs, and I'm sure they have uses in streaming generally [7]. Thus,
>>> by improving the streaming API to allow ML algorithms, the streaming API
>>> benefit too (which is important as they have a lot more production users
>>> than the batch API).
>>>
>>> 4) Performance can be at least as good
>>> I believe the same performance could be achieved with the streaming API
>>> as
>>> with the batch API. Streaming API is much closer to the runtime than the
>>> batch API. For corner-cases, with runtime-layer optimizations of batch
>>> API,
>>> we could find a way to do the same (or similar) optimization for the
>>> streaming API (see my previous point). Such case could be using managed
>>> memory (and spilling to disk). There are also benefits by default, e.g.
>>> we
>>> would have a finer grained fault tolerance with the streaming API.
>>>
>>> 5) We could keep batch ML API
>>> For the shorter term, we should not throw away all the algorithms
>>> implemented with the batch API. By pushing forward the development with
>>> side inputs we could make them usable with streaming API. Then, if the
>>> library gains some popularity, we could replace the algorithms in the
>>> batch
>>> API with streaming ones, to avoid the performance costs of e.g. not being
>>> able to persist.
>>>
>>> 6) General tools for implementing ML algorithms
>>> Besides implementing algorithms one by one, we could give more general
>>> tools for making it easier to implement algorithms. E.g. parameter server
>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
>>> similar
>>> model to Flink streaming, we could look into that too. I think often when
>>> deploying a production ML system, much more configuration and tweaking
>>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>>
>>> 7) Showcasing
>>> Showcasing this could be easier. We could say that we're doing batch ML
>>> with a streaming API. That's interesting in its own. IMHO this
>>> integration
>>> is also a more approachable way towards end-to-end ML.
>>>
>>>
>>> Thanks for reading so far :)
>>>
>>> [1] https://github.com/apache/flink/pull/2819
>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>> 13-final77.pdf
>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>> Scoped+Loops+and+Job+Termination
>>> [6] https://github.com/apache/flink/pull/1668
>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf
>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>> Parameter-Server-implementation-td15880.html
>>>
>>> Cheers,
>>> Gabor
>>>
>>>
>

Re: [DISCUSS] Flink ML roadmap

Posted by Gábor Hermann <ma...@gaborhermann.com>.

@Theodore: Great to hear you think the "batch on streaming" approach is 
possible! Of course, we need to pay attention all the pitfalls there, if 
we go that way.

+1 for a design doc!

I would add that it's possible to make efforts in all the three 
directions (i.e. batch, online, batch on streaming) at the same time. 
Although, it might be worth to concentrate on one. E.g. it would not be 
so useful to have the same batch algorithms with both the batch API and 
streaming API. We can decide later.

The design doc could be partitioned to these 3 directions, and we can 
collect there the pros/cons too. What do you think?

Cheers,
Gabor


On 2017-02-23 12:13, Theodore Vasiloudis wrote:
> Hello all,
>
>
> @Gabor, we have discussed the idea of using the streaming API to write all
> of our ML algorithms with a couple of people offline,
> and I think it might be possible and is generally worth a shot. The
> approach we would take would be close to Vowpal Wabbit, not exactly
> "online", but rather "fast-batch".
>
> There will be problems popping up again, even for very simple algos like on
> line linear regression with SGD [1], but hopefully fixing those will be
> more aligned with the priorities of the community.
>
> @Katherin, my understanding is that given the limited resources, there is
> no development effort focused on batch processing right now.
>
> So to summarize, it seems like there are people willing to work on ML on
> Flink, but nobody is sure how to do it.
> There are many directions we could take (batch, online, batch on
> streaming), each with its own merits and downsides.
>
> If you want we can start a design doc and move the conversation there, come
> up with a roadmap and start implementing.
>
> Regards,
> Theodore
>
> [1]
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Understanding-connected-streams-use-without-timestamps-td10241.html
>
> On Tue, Feb 21, 2017 at 11:17 PM, G�bor Hermann <ma...@gaborhermann.com>
> wrote:
>
>> It's great to see so much activity in this discussion :)
>> I'll try to add my thoughts.
>>
>> I think building a developer community (Till's 2. point) can be slightly
>> separated from what features we should aim for (1. point) and showcasing
>> (3. point). Thanks Till for bringing up the ideas for restructuring, I'm
>> sure we'll find a way to make the development process more dynamic. I'll
>> try to address the rest here.
>>
>> It's hard to choose directions between streaming and batch ML. As Theo has
>> indicated, not much online ML is used in production, but Flink concentrates
>> on streaming, so online ML would be a better fit for Flink. However, as
>> most of you argued, there's definite need for batch ML. But batch ML seems
>> hard to achieve because there are blocking issues with persisting,
>> iteration paths etc. So it's no good either way.
>>
>> I propose a seemingly crazy solution: what if we developed batch
>> algorithms also with the streaming API? The batch API would clearly seem
>> more suitable for ML algorithms, but there a lot of benefits of this
>> approach too, so it's clearly worth considering. Flink also has the high
>> level vision of "streaming for everything" that would clearly fit this
>> case. What do you all think about this? Do you think this solution would be
>> feasible? I would be happy to make a more elaborate proposal, but I push my
>> main ideas here:
>>
>> 1) Simplifying by using one system
>> It could simplify the work of both the users and the developers. One could
>> execute training once, or could execute it periodically e.g. by using
>> windows. Low-latency serving and training could be done in the same system.
>> We could implement incremental algorithms, without any side inputs for
>> combining online learning (or predictions) with batch learning. Of course,
>> all the logic describing these must be somehow implemented (e.g.
>> synchronizing predictions with training), but it should be easier to do so
>> in one system, than by combining e.g. the batch and streaming API.
>>
>> 2) Batch ML with the streaming API is not harder
>> Despite these benefits, it could seem harder to implement batch ML with
>> the streaming API, but in my opinion it's not. There are more flexible,
>> lower-level optimization potentials with the streaming API. Most
>> distributed ML algorithms use a lower-level model than the batch API
>> anyway, so sometimes it feels like forcing the algorithm logic into the
>> training API and tweaking it. Although we could not use the batch
>> primitives like join, we would have the E.g. in my experience with
>> implementing a distributed matrix factorization algorithm [1], I couldn't
>> do a simple optimization because of the limitations of the iteration API
>> [2]. Even if we pushed all the development effort to make the batch API
>> more suitable for ML there would be things we couldn't do. E.g. there are
>> approaches for updating a model iteratively without locks [3,4] (i.e.
>> somewhat asynchronously), and I don't see a clear way to implement such
>> algorithms with the batch API.
>>
>> 3) Streaming community (users and devs) benefit
>> The Flink streaming community in general would also benefit from this
>> direction. There are many features needed in the streaming API for ML to
>> work, but this is also true for the batch API. One really important is the
>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot of
>> effort (mostly from Paris) for making it mature enough [6]. Kate mentioned
>> using GPUs, and I'm sure they have uses in streaming generally [7]. Thus,
>> by improving the streaming API to allow ML algorithms, the streaming API
>> benefit too (which is important as they have a lot more production users
>> than the batch API).
>>
>> 4) Performance can be at least as good
>> I believe the same performance could be achieved with the streaming API as
>> with the batch API. Streaming API is much closer to the runtime than the
>> batch API. For corner-cases, with runtime-layer optimizations of batch API,
>> we could find a way to do the same (or similar) optimization for the
>> streaming API (see my previous point). Such case could be using managed
>> memory (and spilling to disk). There are also benefits by default, e.g. we
>> would have a finer grained fault tolerance with the streaming API.
>>
>> 5) We could keep batch ML API
>> For the shorter term, we should not throw away all the algorithms
>> implemented with the batch API. By pushing forward the development with
>> side inputs we could make them usable with streaming API. Then, if the
>> library gains some popularity, we could replace the algorithms in the batch
>> API with streaming ones, to avoid the performance costs of e.g. not being
>> able to persist.
>>
>> 6) General tools for implementing ML algorithms
>> Besides implementing algorithms one by one, we could give more general
>> tools for making it easier to implement algorithms. E.g. parameter server
>> [8,9]. Theo also mentioned in another thread that TensorFlow has a similar
>> model to Flink streaming, we could look into that too. I think often when
>> deploying a production ML system, much more configuration and tweaking
>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>
>> 7) Showcasing
>> Showcasing this could be easier. We could say that we're doing batch ML
>> with a streaming API. That's interesting in its own. IMHO this integration
>> is also a more approachable way towards end-to-end ML.
>>
>>
>> Thanks for reading so far :)
>>
>> [1] https://github.com/apache/flink/pull/2819
>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>> 13-final77.pdf
>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>> Scoped+Loops+and+Job+Termination
>> [6] https://github.com/apache/flink/pull/1668
>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf
>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>> com/Using-QueryableState-inside-Flink-jobs-and-
>> Parameter-Server-implementation-td15880.html
>>
>> Cheers,
>> Gabor
>>

Re: [DISCUSS] Flink ML roadmap

Posted by Theodore Vasiloudis <th...@gmail.com>.

Hello all,


@Gabor, we have discussed the idea of using the streaming API to write all
of our ML algorithms with a couple of people offline,
and I think it might be possible and is generally worth a shot. The
approach we would take would be close to Vowpal Wabbit, not exactly
"online", but rather "fast-batch".

There will be problems popping up again, even for very simple algos like on
line linear regression with SGD [1], but hopefully fixing those will be
more aligned with the priorities of the community.

@Katherin, my understanding is that given the limited resources, there is
no development effort focused on batch processing right now.

So to summarize, it seems like there are people willing to work on ML on
Flink, but nobody is sure how to do it.
There are many directions we could take (batch, online, batch on
streaming), each with its own merits and downsides.

If you want we can start a design doc and move the conversation there, come
up with a roadmap and start implementing.

Regards,
Theodore

[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Understanding-connected-streams-use-without-timestamps-td10241.html

On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <ma...@gaborhermann.com>
wrote:

> It's great to see so much activity in this discussion :)
> I'll try to add my thoughts.
>
> I think building a developer community (Till's 2. point) can be slightly
> separated from what features we should aim for (1. point) and showcasing
> (3. point). Thanks Till for bringing up the ideas for restructuring, I'm
> sure we'll find a way to make the development process more dynamic. I'll
> try to address the rest here.
>
> It's hard to choose directions between streaming and batch ML. As Theo has
> indicated, not much online ML is used in production, but Flink concentrates
> on streaming, so online ML would be a better fit for Flink. However, as
> most of you argued, there's definite need for batch ML. But batch ML seems
> hard to achieve because there are blocking issues with persisting,
> iteration paths etc. So it's no good either way.
>
> I propose a seemingly crazy solution: what if we developed batch
> algorithms also with the streaming API? The batch API would clearly seem
> more suitable for ML algorithms, but there a lot of benefits of this
> approach too, so it's clearly worth considering. Flink also has the high
> level vision of "streaming for everything" that would clearly fit this
> case. What do you all think about this? Do you think this solution would be
> feasible? I would be happy to make a more elaborate proposal, but I push my
> main ideas here:
>
> 1) Simplifying by using one system
> It could simplify the work of both the users and the developers. One could
> execute training once, or could execute it periodically e.g. by using
> windows. Low-latency serving and training could be done in the same system.
> We could implement incremental algorithms, without any side inputs for
> combining online learning (or predictions) with batch learning. Of course,
> all the logic describing these must be somehow implemented (e.g.
> synchronizing predictions with training), but it should be easier to do so
> in one system, than by combining e.g. the batch and streaming API.
>
> 2) Batch ML with the streaming API is not harder
> Despite these benefits, it could seem harder to implement batch ML with
> the streaming API, but in my opinion it's not. There are more flexible,
> lower-level optimization potentials with the streaming API. Most
> distributed ML algorithms use a lower-level model than the batch API
> anyway, so sometimes it feels like forcing the algorithm logic into the
> training API and tweaking it. Although we could not use the batch
> primitives like join, we would have the E.g. in my experience with
> implementing a distributed matrix factorization algorithm [1], I couldn't
> do a simple optimization because of the limitations of the iteration API
> [2]. Even if we pushed all the development effort to make the batch API
> more suitable for ML there would be things we couldn't do. E.g. there are
> approaches for updating a model iteratively without locks [3,4] (i.e.
> somewhat asynchronously), and I don't see a clear way to implement such
> algorithms with the batch API.
>
> 3) Streaming community (users and devs) benefit
> The Flink streaming community in general would also benefit from this
> direction. There are many features needed in the streaming API for ML to
> work, but this is also true for the batch API. One really important is the
> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot of
> effort (mostly from Paris) for making it mature enough [6]. Kate mentioned
> using GPUs, and I'm sure they have uses in streaming generally [7]. Thus,
> by improving the streaming API to allow ML algorithms, the streaming API
> benefit too (which is important as they have a lot more production users
> than the batch API).
>
> 4) Performance can be at least as good
> I believe the same performance could be achieved with the streaming API as
> with the batch API. Streaming API is much closer to the runtime than the
> batch API. For corner-cases, with runtime-layer optimizations of batch API,
> we could find a way to do the same (or similar) optimization for the
> streaming API (see my previous point). Such case could be using managed
> memory (and spilling to disk). There are also benefits by default, e.g. we
> would have a finer grained fault tolerance with the streaming API.
>
> 5) We could keep batch ML API
> For the shorter term, we should not throw away all the algorithms
> implemented with the batch API. By pushing forward the development with
> side inputs we could make them usable with streaming API. Then, if the
> library gains some popularity, we could replace the algorithms in the batch
> API with streaming ones, to avoid the performance costs of e.g. not being
> able to persist.
>
> 6) General tools for implementing ML algorithms
> Besides implementing algorithms one by one, we could give more general
> tools for making it easier to implement algorithms. E.g. parameter server
> [8,9]. Theo also mentioned in another thread that TensorFlow has a similar
> model to Flink streaming, we could look into that too. I think often when
> deploying a production ML system, much more configuration and tweaking
> should be done than e.g. Spark MLlib allows. Why not allow that?
>
> 7) Showcasing
> Showcasing this could be easier. We could say that we're doing batch ML
> with a streaming API. That's interesting in its own. IMHO this integration
> is also a more approachable way towards end-to-end ML.
>
>
> Thanks for reading so far :)
>
> [1] https://github.com/apache/flink/pull/2819
> [2] https://issues.apache.org/jira/browse/FLINK-2396
> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
> 13-final77.pdf
> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
> Scoped+Loops+and+Job+Termination
> [6] https://github.com/apache/flink/pull/1668
> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf
> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> com/Using-QueryableState-inside-Flink-jobs-and-
> Parameter-Server-implementation-td15880.html
>
> Cheers,
> Gabor
>

Re: [DISCUSS] Flink ML roadmap

Posted by Gábor Hermann <ma...@gaborhermann.com>.

It's great to see so much activity in this discussion :)
I'll try to add my thoughts.

I think building a developer community (Till's 2. point) can be slightly
separated from what features we should aim for (1. point) and showcasing
(3. point). Thanks Till for bringing up the ideas for restructuring, I'm
sure we'll find a way to make the development process more dynamic. I'll
try to address the rest here.

It's hard to choose directions between streaming and batch ML. As Theo
has indicated, not much online ML is used in production, but Flink
concentrates on streaming, so online ML would be a better fit for Flink.
However, as most of you argued, there's definite need for batch ML. But
batch ML seems hard to achieve because there are blocking issues with
persisting, iteration paths etc. So it's no good either way.

I propose a seemingly crazy solution: what if we developed batch
algorithms also with the streaming API? The batch API would clearly seem
more suitable for ML algorithms, but there a lot of benefits of this
approach too, so it's clearly worth considering. Flink also has the high
level vision of "streaming for everything" that would clearly fit this
case. What do you all think about this? Do you think this solution would
be feasible? I would be happy to make a more elaborate proposal, but I
push my main ideas here:

1) Simplifying by using one system
It could simplify the work of both the users and the developers. One
could execute training once, or could execute it periodically e.g. by
using windows. Low-latency serving and training could be done in the
same system. We could implement incremental algorithms, without any side
inputs for combining online learning (or predictions) with batch
learning. Of course, all the logic describing these must be somehow
implemented (e.g. synchronizing predictions with training), but it
should be easier to do so in one system, than by combining e.g. the
batch and streaming API.

2) Batch ML with the streaming API is not harder
Despite these benefits, it could seem harder to implement batch ML with
the streaming API, but in my opinion it's not. There are more flexible,
lower-level optimization potentials with the streaming API. Most
distributed ML algorithms use a lower-level model than the batch API
anyway, so sometimes it feels like forcing the algorithm logic into the
training API and tweaking it. Although we could not use the batch
primitives like join, we would have the E.g. in my experience with
implementing a distributed matrix factorization algorithm [1], I
couldn't do a simple optimization because of the limitations of the
iteration API [2]. Even if we pushed all the development effort to make
the batch API more suitable for ML there would be things we couldn't do.
E.g. there are approaches for updating a model iteratively without locks
[3,4] (i.e. somewhat asynchronously), and I don't see a clear way to
implement such algorithms with the batch API.

3) Streaming community (users and devs) benefit
The Flink streaming community in general would also benefit from this
direction. There are many features needed in the streaming API for ML to
work, but this is also true for the batch API. One really important is
the loops API (a.k.a. iterative DataStreams) [5]. There has been a lot
of effort (mostly from Paris) for making it mature enough [6]. Kate
mentioned using GPUs, and I'm sure they have uses in streaming generally
[7]. Thus, by improving the streaming API to allow ML algorithms, the
streaming API benefit too (which is important as they have a lot more
production users than the batch API).

4) Performance can be at least as good
I believe the same performance could be achieved with the streaming API
as with the batch API. Streaming API is much closer to the runtime than
the batch API. For corner-cases, with runtime-layer optimizations of
batch API, we could find a way to do the same (or similar) optimization
for the streaming API (see my previous point). Such case could be using
managed memory (and spilling to disk). There are also benefits by
default, e.g. we would have a finer grained fault tolerance with the
streaming API.

5) We could keep batch ML API
For the shorter term, we should not throw away all the algorithms
implemented with the batch API. By pushing forward the development with
side inputs we could make them usable with streaming API. Then, if the
library gains some popularity, we could replace the algorithms in the
batch API with streaming ones, to avoid the performance costs of e.g.
not being able to persist.

6) General tools for implementing ML algorithms
Besides implementing algorithms one by one, we could give more general
tools for making it easier to implement algorithms. E.g. parameter
server [8,9]. Theo also mentioned in another thread that TensorFlow has
a similar model to Flink streaming, we could look into that too. I think
often when deploying a production ML system, much more configuration and
tweaking should be done than e.g. Spark MLlib allows. Why not allow that?

7) Showcasing
Showcasing this could be easier. We could say that we're doing batch ML
with a streaming API. That's interesting in its own. IMHO this
integration is also a more approachable way towards end-to-end ML.

Thanks for reading so far :)

[1] https://github.com/apache/flink/pull/2819
[2] https://issues.apache.org/jira/browse/FLINK-2396
[3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
[4]
https://www.usenix.org/system/files/conference/hotos13/hotos13-final77.pdf
[5]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+Scoped+Loops+and+Job+Termination

[6] https://github.com/apache/flink/pull/1668
[7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf
[8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
[9]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Using-QueryableState-inside-Flink-jobs-and-Parameter-Server-implementation-td15880.html

Cheers,
Gabor

Re: [DISCUSS] Flink ML roadmap

Posted by Katherin Eri <ka...@gmail.com>.

Till, thank you for your response.
But I need several points to clarify:

1) Yes, batch and batch ML is the field full of alternatives, but in my
opinion that doesn’t mean that we should ignore the problem of not
developing batch part of Flink. You know: Apache Beam, Apache Mahout they
both feel the lack of normally implemented batching feature. DL4J will be
able to integrate with Apache Flink, but this integration will work only on
paper, and not efficient in production.

Did you mean with this phrase: “*Unfortunately, all of these problems are
far from trivial to solve and will require quite some changes to Flink's
runtime. Given Flink's current focus on stream processing, I don't see
enough community capacities left to implement these features soon*.”, that
Apache Flink won’t pay attention to batch part of it, or I have got you
wrong?

2) Yes, reimplementing libraries that already were developed by community
is not a good way, but maybe we should make from Flink engine that could
easily run ML libraries on top of it: integrate with SystemML, DL4J and
many many others? But doing this we well still required batch calculations.

вт, 21 февр. 2017 г. в 18:01, Stavros Kontopoulos <st.kontopoulos@gmail.com
>:

> Ok I see. Suppose we solve all the critical issues. And suppose we dont go
> with the pure online model (although online ML has a potential)... should
> we move on with the
> current ML implementation which is for batch processing (to the best of my
> knowledge)? The parameter server problem is a long standing one and many
> companies out there started to provide their own solutions. That would be
> very useful but I see it only as part of the solution.
>
> The other thing is that when someone is working locally and does some work
> with Flink he should need to go out of it to play with other libraries.
> Isnt this important for the product success?
>
> Regards,
> Stavros
> On Tue, Feb 21, 2017 at 1:04 PM, Theodore Vasiloudis <
> theodoros.vasiloudis@gmail.com> wrote:
>
> > Thank you all for your thoughts on the matter.
> >
> > Andrea brought up some further engine considerations that we need to
> > address in order to have a competitive ML engine on Flink.
> >
> > I'm happy to see many people willing to contribute to the development of
> ML
> > on Flink. The way I see it, there needs to be buy-in from the rest of the
> > community for such changes to go through.
> >
> > If then you are interested in helping out, tackling one of the issues
> > mentioned in my previous email or the ones mentioned by Andrea are the
> most
> > critical ones, as they require making changes to the core.
> >
> > If you want to take up one of those issues the best way is to start a
> > conversation on the list, and gauge the opinion of the community.
> >
> > Finally, as Stavros mentioned, we need to come up with an updated roadmap
> > for FlinkML that includes these issues.
> >
> > @Andrea, the idea of an online learning library for Flink has been
> broached
> > before, and this semester I have one Master student working on exactly
> > that. From my conversations with people in the industry however, almost
> > nobody uses online learning in production, at best models are updated
> every
> > 5 minutes. So the impact would probably not be very large.
> >
> > I would like to bring up again the topic of model serving that I think
> fits
> > the Flink use-case much better. Developing a system like Clipper [1] on
> top
> > of Flink could be one of the best ways to use Flink for ML.
> >
> > Regards,
> > Theodore
> >
> > [1]  Clipper: A Low-Latency Online Prediction Serving System -
> > https://arxiv.org/abs/1612.03079
> >
> > On Tue, Feb 21, 2017 at 12:10 AM, Andrea Spina <
> andrea.spina@radicalbit.io
> > >
> > wrote:
> >
> > > Hi all,
> > >
> > > Thanks Stavros for pushing forward the discussion which I feel really
> > > relevant.
> > >
> > > Since I'm approaching actively the community just right now and I
> haven't
> > > enough experience and such visibility around the Flink community, I'd
> > limit
> > > myself to share an opinion as a Flink user.
> > >
> > > I'm using Flink since almost a year along two different experiences,
> but
> > > I've bumped into the question "how to handle ML workloads and keep
> Flink
> > as
> > > the main engine?" in both cases. Then the first point raises in my
> mind:
> > > why
> > > do I need to adopt an extra system for purely ML purposes: how amazing
> > > could
> > > be to benefit the Flink engine as ML features provider and to avoid
> > paying
> > > the effort to maintain an additional engine? This thought links also
> > @Timur
> > > opinion: I believe that users would prefer way more a unified
> > architecture
> > > in this case. Even if a user want to use an external tool/library -
> > perhaps
> > > providing additional language support (e.g. R) - so that user should be
> > > capable to run it on top of Flink.
> > >
> > > Along my work with Flink I needed to implement some ML algorithms on
> both
> > > Flink and Spark and I often struggled with Flink performances: namely,
> I
> > > think (in the name of the bigger picture) we should first focus the
> > effort
> > > on solving some well-known Flink limitations as @theodore pinpointed.
> I'd
> > > like to highlight [1] and [2] which I find relevant. Since the
> community
> > > would decide to go ahead with FlinkML I believe fixing the above
> > described
> > > issues may be a good starting point. That would also definitely push
> > > forward
> > > some important integrations as Apache SystemML.
> > >
> > > Given all these points, I'm increasingly convinced that Online Machine
> > > Learning would be the real final objective and the more suitable goal
> > since
> > > we're talking about a real-time streaming engine and - from a real high
> > > point of view - I believe Flink would fit this topic in a more genuine
> > way
> > > than the batch case. We've a connector for Apache SAMOA, but it seems
> in
> > an
> > > early stage of development IMHO and not really active. If we want to
> make
> > > something within Flink instead, we need to speed up the design of some
> > > features (e.g. side inputs [3]).
> > >
> > > I really hope we can define a new roadmap by which we can finally push
> > > forward the topic. I will put my best to help in this way.
> > >
> > > Sincerely,
> > > Andrea
> > >
> > > [1] Add a FlinkTools.persist style method to the Data Set
> > > https://issues.apache.org/jira/browse/FLINK-1730
> > > [2] Only send data to each taskmanager once for broadcasts
> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-
> > > 5%3A+Only+send+data+to+each+taskmanager+once+for+broadcasts
> > > [3] Side inputs - Evolving or static Filter/Enriching
> > >
> https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-
> > > MKQYN3m4/edit#
> > > http://apache-flink-mailing-list-archive.1008284.n3.
> > > nabble.com/DISCUSS-Add-Side-Input-Broadcast-Set-For-
> > > Streaming-API-td11529.html
> > >
> > >
> > >
> > > --
> > > View this message in context: http://apache-flink-mailing-
> > > list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-
> > > roadmap-tp16040p16064.html
> > > Sent from the Apache Flink Mailing List archive. mailing list archive
> at
> > > Nabble.com.
> > >
> >
>
-- 

*Yours faithfully, *

*Kate Eri.*

Re: [DISCUSS] Flink ML roadmap

Posted by Stavros Kontopoulos <st...@gmail.com>.

Ok I see. Suppose we solve all the critical issues. And suppose we dont go
with the pure online model (although online ML has a potential)... should
we move on with the
current ML implementation which is for batch processing (to the best of my
knowledge)? The parameter server problem is a long standing one and many
companies out there started to provide their own solutions. That would be
very useful but I see it only as part of the solution.

The other thing is that when someone is working locally and does some work
with Flink he should need to go out of it to play with other libraries.
Isnt this important for the product success?

Regards,
Stavros
On Tue, Feb 21, 2017 at 1:04 PM, Theodore Vasiloudis <
theodoros.vasiloudis@gmail.com> wrote:

> Thank you all for your thoughts on the matter.
>
> Andrea brought up some further engine considerations that we need to
> address in order to have a competitive ML engine on Flink.
>
> I'm happy to see many people willing to contribute to the development of ML
> on Flink. The way I see it, there needs to be buy-in from the rest of the
> community for such changes to go through.
>
> If then you are interested in helping out, tackling one of the issues
> mentioned in my previous email or the ones mentioned by Andrea are the most
> critical ones, as they require making changes to the core.
>
> If you want to take up one of those issues the best way is to start a
> conversation on the list, and gauge the opinion of the community.
>
> Finally, as Stavros mentioned, we need to come up with an updated roadmap
> for FlinkML that includes these issues.
>
> @Andrea, the idea of an online learning library for Flink has been broached
> before, and this semester I have one Master student working on exactly
> that. From my conversations with people in the industry however, almost
> nobody uses online learning in production, at best models are updated every
> 5 minutes. So the impact would probably not be very large.
>
> I would like to bring up again the topic of model serving that I think fits
> the Flink use-case much better. Developing a system like Clipper [1] on top
> of Flink could be one of the best ways to use Flink for ML.
>
> Regards,
> Theodore
>
> [1]  Clipper: A Low-Latency Online Prediction Serving System -
> https://arxiv.org/abs/1612.03079
>
> On Tue, Feb 21, 2017 at 12:10 AM, Andrea Spina <andrea.spina@radicalbit.io
> >
> wrote:
>
> > Hi all,
> >
> > Thanks Stavros for pushing forward the discussion which I feel really
> > relevant.
> >
> > Since I'm approaching actively the community just right now and I haven't
> > enough experience and such visibility around the Flink community, I'd
> limit
> > myself to share an opinion as a Flink user.
> >
> > I'm using Flink since almost a year along two different experiences, but
> > I've bumped into the question "how to handle ML workloads and keep Flink
> as
> > the main engine?" in both cases. Then the first point raises in my mind:
> > why
> > do I need to adopt an extra system for purely ML purposes: how amazing
> > could
> > be to benefit the Flink engine as ML features provider and to avoid
> paying
> > the effort to maintain an additional engine? This thought links also
> @Timur
> > opinion: I believe that users would prefer way more a unified
> architecture
> > in this case. Even if a user want to use an external tool/library -
> perhaps
> > providing additional language support (e.g. R) - so that user should be
> > capable to run it on top of Flink.
> >
> > Along my work with Flink I needed to implement some ML algorithms on both
> > Flink and Spark and I often struggled with Flink performances: namely, I
> > think (in the name of the bigger picture) we should first focus the
> effort
> > on solving some well-known Flink limitations as @theodore pinpointed. I'd
> > like to highlight [1] and [2] which I find relevant. Since the community
> > would decide to go ahead with FlinkML I believe fixing the above
> described
> > issues may be a good starting point. That would also definitely push
> > forward
> > some important integrations as Apache SystemML.
> >
> > Given all these points, I'm increasingly convinced that Online Machine
> > Learning would be the real final objective and the more suitable goal
> since
> > we're talking about a real-time streaming engine and - from a real high
> > point of view - I believe Flink would fit this topic in a more genuine
> way
> > than the batch case. We've a connector for Apache SAMOA, but it seems in
> an
> > early stage of development IMHO and not really active. If we want to make
> > something within Flink instead, we need to speed up the design of some
> > features (e.g. side inputs [3]).
> >
> > I really hope we can define a new roadmap by which we can finally push
> > forward the topic. I will put my best to help in this way.
> >
> > Sincerely,
> > Andrea
> >
> > [1] Add a FlinkTools.persist style method to the Data Set
> > https://issues.apache.org/jira/browse/FLINK-1730
> > [2] Only send data to each taskmanager once for broadcasts
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-
> > 5%3A+Only+send+data+to+each+taskmanager+once+for+broadcasts
> > [3] Side inputs - Evolving or static Filter/Enriching
> > https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-
> > MKQYN3m4/edit#
> > http://apache-flink-mailing-list-archive.1008284.n3.
> > nabble.com/DISCUSS-Add-Side-Input-Broadcast-Set-For-
> > Streaming-API-td11529.html
> >
> >
> >
> > --
> > View this message in context: http://apache-flink-mailing-
> > list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-
> > roadmap-tp16040p16064.html
> > Sent from the Apache Flink Mailing List archive. mailing list archive at
> > Nabble.com.
> >
>

Re: [DISCUSS] Flink ML roadmap

Posted by Till Rohrmann <tr...@apache.org>.

Thanks a lot for all your valuable input. It's great to see all your
interest in Flink and its ML library :-)

1) Direction of FlinkML

In order to reboot the FlinkML library we should indeed first decide on its
direction and come up with a roadmap to get the community behind.

Since we only have limited resources the question for me is first of all
whether we continue developing a batch ML library or whether we concentrate
on streaming machine learning.

The core idea of FlinkML was to provide the user with an easy toolbox to
create machine learning pipelines. These pipelines are per se not batch or
streaming specific but so far all our implementations are based on Flink's
batch API.

While implementing the ML algorithms we realized that Flink's engine has
still some deficiencies on the batch side. Theo already mentioned the
iteration problem with static inputs [1] and the problem of caching
intermediate results [2]. But there are also other problems such as dynamic
memory management [3] and a leg wise scheduling [4] for complex topologies.
Without these features, I don't see that Flink will be able to efficiently
execute batch ML jobs. Unfortunately, all of these problems are far from
trivial to solve and will require quite some changes to Flink's runtime.
Given Flink's current focus on stream processing, I don't see enough
community capacities left to implement these features soon.

Furthermore, if we decide to continue pursuing the batch direction, then
we'll be in direct competition with more established frameworks such as
SparkML, Weka, TensorFlow and scikit-learn, for example. I guess that alone
the work to catch up with these libraries in terms of algorithm support
will be quite challenging.

Therefore, I think it would be more promising to concentrate on streaming
ML and try to establish Flink's brand there. Streaming ML has not been as
thoroughly explored as the batch counterpart and there are not too many
players on the field. Furthermore, it would be well aligned with the
direction of the rest of the project.

1.1) Possible features

I agree with Theo that model serving/low latency prediction would be a
really good/almost natural use case for Flink. For that we would need to be
able to import trained models and do predictions with them. Maybe Clipper
is a good solution for that or maybe PMML or another model format. That is
something we would have to research.

Next, in order to support continuous model updates (maybe from a
periodically triggered batch job) we would need side input support.

With these two features we could probably already realize some really cool
use cases.

2) Growing Flink's ML community

One of the problems with FlinkML, as you've mentioned it, was the lack of
active committer support after the initial development. As Gabor pointed
out if there is no committer around then there is only little chance to
become one if nothing gets merged, even though we're in heavy need for them.

Since I'm the culprit in this case, I can tell you that it would be
tremendously helpful if the community (including in our case mostly
contributors) continues reviewing actively each others PRs. If a PR is in
good shape than it's much easier (less work) for to merge it. I think this
could be an immediate action point.

Next, I started a discussion thread [5] about restructuring Flink in order
to decrease test and build times but also to allow adding new committers
more easily for modules where we have a high need. Maybe this can help to
solve the committer problem.

3) Showcasing capabilities

I agree with Timur's observation that we have far too little material out
there which showcases what's actually possible to do with Flink wrt ML.
That is something which we can start right away to change. One good
possibility is always to write a blog post about an interesting use case
you've implemented. Thus, I like very much Katherin's idea. And indeed when
I implemented the ALS matrix factorization with Flink, we came across a lot
of problems with Flink.

The other good option which was mentioned is the creation of a kind of ML
cookbook. The cookbook could contain advanced recipes how to solve certain
problems with FlinkML. The Flink community always wanted to create such a
cookbook for Flink in general. Maybe we could lay the first foundation for
it.

[1] https://issues.apache.org/jira/browse/FLINK-2396
[2] https://issues.apache.org/jira/browse/FLINK-1404
[3] https://issues.apache.org/jira/browse/FLINK-1101
[4] https://issues.apache.org/jira/browse/FLINK-2119
[5]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Project-build-time-and-possible-restructuring-tt16088.html

Cheers,
Till

On Tue, Feb 21, 2017 at 12:04 PM, Theodore Vasiloudis <
theodoros.vasiloudis@gmail.com> wrote:

> Thank you all for your thoughts on the matter.
>
> Andrea brought up some further engine considerations that we need to
> address in order to have a competitive ML engine on Flink.
>
> I'm happy to see many people willing to contribute to the development of ML
> on Flink. The way I see it, there needs to be buy-in from the rest of the
> community for such changes to go through.
>
> If then you are interested in helping out, tackling one of the issues
> mentioned in my previous email or the ones mentioned by Andrea are the most
> critical ones, as they require making changes to the core.
>
> If you want to take up one of those issues the best way is to start a
> conversation on the list, and gauge the opinion of the community.
>
> Finally, as Stavros mentioned, we need to come up with an updated roadmap
> for FlinkML that includes these issues.
>
> @Andrea, the idea of an online learning library for Flink has been broached
> before, and this semester I have one Master student working on exactly
> that. From my conversations with people in the industry however, almost
> nobody uses online learning in production, at best models are updated every
> 5 minutes. So the impact would probably not be very large.
>
> I would like to bring up again the topic of model serving that I think fits
> the Flink use-case much better. Developing a system like Clipper [1] on top
> of Flink could be one of the best ways to use Flink for ML.
>
> Regards,
> Theodore
>
> [1]  Clipper: A Low-Latency Online Prediction Serving System -
> https://arxiv.org/abs/1612.03079
>
> On Tue, Feb 21, 2017 at 12:10 AM, Andrea Spina <andrea.spina@radicalbit.io
> >
> wrote:
>
> > Hi all,
> >
> > Thanks Stavros for pushing forward the discussion which I feel really
> > relevant.
> >
> > Since I'm approaching actively the community just right now and I haven't
> > enough experience and such visibility around the Flink community, I'd
> limit
> > myself to share an opinion as a Flink user.
> >
> > I'm using Flink since almost a year along two different experiences, but
> > I've bumped into the question "how to handle ML workloads and keep Flink
> as
> > the main engine?" in both cases. Then the first point raises in my mind:
> > why
> > do I need to adopt an extra system for purely ML purposes: how amazing
> > could
> > be to benefit the Flink engine as ML features provider and to avoid
> paying
> > the effort to maintain an additional engine? This thought links also
> @Timur
> > opinion: I believe that users would prefer way more a unified
> architecture
> > in this case. Even if a user want to use an external tool/library -
> perhaps
> > providing additional language support (e.g. R) - so that user should be
> > capable to run it on top of Flink.
> >
> > Along my work with Flink I needed to implement some ML algorithms on both
> > Flink and Spark and I often struggled with Flink performances: namely, I
> > think (in the name of the bigger picture) we should first focus the
> effort
> > on solving some well-known Flink limitations as @theodore pinpointed. I'd
> > like to highlight [1] and [2] which I find relevant. Since the community
> > would decide to go ahead with FlinkML I believe fixing the above
> described
> > issues may be a good starting point. That would also definitely push
> > forward
> > some important integrations as Apache SystemML.
> >
> > Given all these points, I'm increasingly convinced that Online Machine
> > Learning would be the real final objective and the more suitable goal
> since
> > we're talking about a real-time streaming engine and - from a real high
> > point of view - I believe Flink would fit this topic in a more genuine
> way
> > than the batch case. We've a connector for Apache SAMOA, but it seems in
> an
> > early stage of development IMHO and not really active. If we want to make
> > something within Flink instead, we need to speed up the design of some
> > features (e.g. side inputs [3]).
> >
> > I really hope we can define a new roadmap by which we can finally push
> > forward the topic. I will put my best to help in this way.
> >
> > Sincerely,
> > Andrea
> >
> > [1] Add a FlinkTools.persist style method to the Data Set
> > https://issues.apache.org/jira/browse/FLINK-1730
> > [2] Only send data to each taskmanager once for broadcasts
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-
> > 5%3A+Only+send+data+to+each+taskmanager+once+for+broadcasts
> > [3] Side inputs - Evolving or static Filter/Enriching
> > https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-
> > MKQYN3m4/edit#
> > http://apache-flink-mailing-list-archive.1008284.n3.
> > nabble.com/DISCUSS-Add-Side-Input-Broadcast-Set-For-
> > Streaming-API-td11529.html
> >
> >
> >
> > --
> > View this message in context: http://apache-flink-mailing-
> > list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-
> > roadmap-tp16040p16064.html
> > Sent from the Apache Flink Mailing List archive. mailing list archive at
> > Nabble.com.
> >
>

Re: [DISCUSS] Flink ML roadmap

Posted by Theodore Vasiloudis <th...@gmail.com>.

Thank you all for your thoughts on the matter.

Andrea brought up some further engine considerations that we need to
address in order to have a competitive ML engine on Flink.

I'm happy to see many people willing to contribute to the development of ML
on Flink. The way I see it, there needs to be buy-in from the rest of the
community for such changes to go through.

If then you are interested in helping out, tackling one of the issues
mentioned in my previous email or the ones mentioned by Andrea are the most
critical ones, as they require making changes to the core.

If you want to take up one of those issues the best way is to start a
conversation on the list, and gauge the opinion of the community.

Finally, as Stavros mentioned, we need to come up with an updated roadmap
for FlinkML that includes these issues.

@Andrea, the idea of an online learning library for Flink has been broached
before, and this semester I have one Master student working on exactly
that. From my conversations with people in the industry however, almost
nobody uses online learning in production, at best models are updated every
5 minutes. So the impact would probably not be very large.

I would like to bring up again the topic of model serving that I think fits
the Flink use-case much better. Developing a system like Clipper [1] on top
of Flink could be one of the best ways to use Flink for ML.

Regards,
Theodore

[1]  Clipper: A Low-Latency Online Prediction Serving System -
https://arxiv.org/abs/1612.03079

On Tue, Feb 21, 2017 at 12:10 AM, Andrea Spina <an...@radicalbit.io>
wrote:

> Hi all,
>
> Thanks Stavros for pushing forward the discussion which I feel really
> relevant.
>
> Since I'm approaching actively the community just right now and I haven't
> enough experience and such visibility around the Flink community, I'd limit
> myself to share an opinion as a Flink user.
>
> I'm using Flink since almost a year along two different experiences, but
> I've bumped into the question "how to handle ML workloads and keep Flink as
> the main engine?" in both cases. Then the first point raises in my mind:
> why
> do I need to adopt an extra system for purely ML purposes: how amazing
> could
> be to benefit the Flink engine as ML features provider and to avoid paying
> the effort to maintain an additional engine? This thought links also @Timur
> opinion: I believe that users would prefer way more a unified architecture
> in this case. Even if a user want to use an external tool/library - perhaps
> providing additional language support (e.g. R) - so that user should be
> capable to run it on top of Flink.
>
> Along my work with Flink I needed to implement some ML algorithms on both
> Flink and Spark and I often struggled with Flink performances: namely, I
> think (in the name of the bigger picture) we should first focus the effort
> on solving some well-known Flink limitations as @theodore pinpointed. I'd
> like to highlight [1] and [2] which I find relevant. Since the community
> would decide to go ahead with FlinkML I believe fixing the above described
> issues may be a good starting point. That would also definitely push
> forward
> some important integrations as Apache SystemML.
>
> Given all these points, I'm increasingly convinced that Online Machine
> Learning would be the real final objective and the more suitable goal since
> we're talking about a real-time streaming engine and - from a real high
> point of view - I believe Flink would fit this topic in a more genuine way
> than the batch case. We've a connector for Apache SAMOA, but it seems in an
> early stage of development IMHO and not really active. If we want to make
> something within Flink instead, we need to speed up the design of some
> features (e.g. side inputs [3]).
>
> I really hope we can define a new roadmap by which we can finally push
> forward the topic. I will put my best to help in this way.
>
> Sincerely,
> Andrea
>
> [1] Add a FlinkTools.persist style method to the Data Set
> https://issues.apache.org/jira/browse/FLINK-1730
> [2] Only send data to each taskmanager once for broadcasts
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-
> 5%3A+Only+send+data+to+each+taskmanager+once+for+broadcasts
> [3] Side inputs - Evolving or static Filter/Enriching
> https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-
> MKQYN3m4/edit#
> http://apache-flink-mailing-list-archive.1008284.n3.
> nabble.com/DISCUSS-Add-Side-Input-Broadcast-Set-For-
> Streaming-API-td11529.html
>
>
>
> --
> View this message in context: http://apache-flink-mailing-
> list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-
> roadmap-tp16040p16064.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at
> Nabble.com.
>

Re: [DISCUSS] Flink ML roadmap

Posted by Andrea Spina <an...@radicalbit.io>.

Hi all,

Thanks Stavros for pushing forward the discussion which I feel really
relevant.

Since I'm approaching actively the community just right now and I haven't
enough experience and such visibility around the Flink community, I'd limit
myself to share an opinion as a Flink user.

I'm using Flink since almost a year along two different experiences, but
I've bumped into the question "how to handle ML workloads and keep Flink as
the main engine?" in both cases. Then the first point raises in my mind: why
do I need to adopt an extra system for purely ML purposes: how amazing could
be to benefit the Flink engine as ML features provider and to avoid paying
the effort to maintain an additional engine? This thought links also @Timur
opinion: I believe that users would prefer way more a unified architecture
in this case. Even if a user want to use an external tool/library - perhaps
providing additional language support (e.g. R) - so that user should be
capable to run it on top of Flink.

Along my work with Flink I needed to implement some ML algorithms on both
Flink and Spark and I often struggled with Flink performances: namely, I
think (in the name of the bigger picture) we should first focus the effort
on solving some well-known Flink limitations as @theodore pinpointed. I'd
like to highlight [1] and [2] which I find relevant. Since the community
would decide to go ahead with FlinkML I believe fixing the above described
issues may be a good starting point. That would also definitely push forward
some important integrations as Apache SystemML.

Given all these points, I'm increasingly convinced that Online Machine
Learning would be the real final objective and the more suitable goal since
we're talking about a real-time streaming engine and - from a real high
point of view - I believe Flink would fit this topic in a more genuine way
than the batch case. We've a connector for Apache SAMOA, but it seems in an
early stage of development IMHO and not really active. If we want to make
something within Flink instead, we need to speed up the design of some
features (e.g. side inputs [3]).

I really hope we can define a new roadmap by which we can finally push
forward the topic. I will put my best to help in this way.

Sincerely,
Andrea

[1] Add a FlinkTools.persist style method to the Data Set
https://issues.apache.org/jira/browse/FLINK-1730
[2] Only send data to each taskmanager once for broadcasts
https://cwiki.apache.org/confluence/display/FLINK/FLIP-5%3A+Only+send+data+to+each+taskmanager+once+for+broadcasts
[3] Side inputs - Evolving or static Filter/Enriching
https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-MKQYN3m4/edit#
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Add-Side-Input-Broadcast-Set-For-Streaming-API-td11529.html

--
View this message in context: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-tp16040p16064.html
Sent from the Apache Flink Mailing List archive. mailing list archive at Nabble.com.

Re: [DISCUSS] Flink ML roadmap

Posted by Stavros Kontopoulos <st...@gmail.com>.

I think Flink ML could be a success. Many use cases out there could benefit
from such algorithms especially online ones.
I agree examples should be created showing how it could be used.

I was not aware of the project re-structuring issues. GPUs is really
important nowdays but it is still not the major reason for not adopting
Flink ML. Flink ML has to be developed further and promoted as previously
stated.

In the meantime as for the reviewing part I am investing time there, so I
would like to see if we can join forces and push stuff.

I am aware of the evaluation framework PR and I will review it this week
hopefully. Bu can we commit on pushing anything given the load people have?

As another option could we propose someone to be the committer there as
well, someone Till will guide if it is needed?

I think we dont need to wait for all issues to be solved first. As for the
big picture re-use makes sense but I think the end result should be
something that benefits
Flink.  I would like to stay in Flink as much as possible from a
UX/features side of view. Of course people already use a number of
libraries for years and what we do by implementing the algorithms is
getting those algorithms to work on large datasets plus for streaming,
keeping the UX familiar at the same time.

I think connecting to external libraries should be done if possible for
things not being your domain like dbs or dfs etc... Is it a domain related
for a streaming engine? Use cases drive that IMHO... Again implementation
should be justified by user needs, if there is no such need no reason to
implement anything.

Just some thoughts...


On Mon, Feb 20, 2017 at 3:39 PM, Timur Shenkao <ts...@timshenkao.su> wrote:

> Hello guys,
>
> My couple of cents.
> All Flink presentations, articles, etc. articulate that Flink is for ETL,
> data ingestion. CEP is a maximum.
> If you visit http://flink.apache.org/usecases.html, you'll there aren't
> any
> explicit ML or Graphs there.
> It's also stated that Flink is suitable when "Data that is processed
> quickly".
> That's why people believe that Flink isn't for ML or don't even know that
> Flink has such algorithms.
> Then, folks decide: "I would better use old good Spark or scikit-learn than
> dive into Flink's internals & implement algo by myself "
>
> Sincerely yours, Timur
>
> On Mon, Feb 20, 2017 at 1:53 PM, Katherin Eri <ka...@gmail.com>
> wrote:
>
> > Hello guys,
> >
> >
> > May be we will be able to focus our forces on some E2E scenario or show
> > case for Flink as also ML supporting engine, and in such a way actualize
> > the roadmap?
> >
> >
> > This means: we can take some real life/production problem, like Fraud
> > detection in some area, and try to solve this problem from the point of
> > view of DataScience.
> >
> > Starting from data preprocessing and preparation, finishing
> > implementation/usage of some ML algorithm.
> >
> > Doing this we will understand which issues are showstopper for
> > implementation of such functionality. We will be able to understand
> Flink’s
> > users better.
> >
> >
> > May be community could share its ideas which show case could be the most
> > useful for Apache Flink, or may be Data artisans could lead this?
> >
> > пн, 20 февр. 2017 г. в 15:28, Theodore Vasiloudis <
> > theodoros.vasiloudis@gmail.com>:
> >
> > > Hello all,
> > >
> > > thank you for opening this discussion Stavros, note that it's almost
> > > exactly 1 year since I last opened such a topic (linked by Gabor) and
> the
> > > comments there are still relevant.
> > >
> > > I think Gabor described the current state quite well, development in
> the
> > > libraries is hard without committers dedicated to each project, and as
> a
> > > result FlinkML and CEP have stalled.
> > >
> > > I think it's important to look at why development has stalled as well.
> As
> > > people have mentioned there's a multitude of ML libraries out there and
> > my
> > > impression was that not many people are looking to use Flink for ML.
> > Lately
> > > that seems to have changed (with some interest shown in the Flink
> survey
> > as
> > > well).
> > >
> > > Gabor makes some good points about future directions for the library.
> Our
> > > initial goal [1] was to make a truly scalable, easy to use library,
> > within
> > > the Flink ecosystem, providing a set of "workhorse" algorithms, sampled
> > > from what's actually being used in the industry. We planned for a
> library
> > > that has few algorithms, but does them properly.
> > >
> > > If we decide to go the way of focusing within Flink we face some major
> > > challenges, because these are system limitations that do not
> necessarily
> > > align with the goals of the community. Some issues relevant to ML on
> > Flink
> > > are:
> > >
> > >    - FLINK-2396 - Review the datasets of dynamic path and static path
> in
> > >    iteration.
> > >    https://issues.apache.org/jira/browse/FLINK-2396
> > >    This has to do with the ability to iterate over one datset (model)
> > while
> > >    changing another (dataset), which is necessary for many ML
> algorithms
> > > like
> > >    SGD.
> > >    - FLINK-1730 - Add a FlinkTools.persist style method to the Data
> Set.
> > >    https://issues.apache.org/jira/browse/FLINK-1730
> > >    This is again relevant to many algorithms, to create intermediate
> > >    results etc, for example L-BFGS development has been attempted 2-3
> > > times,
> > >    but always abandoned because of the need to collect a DataSet kills
> > the
> > >    performance.
> > >    - FLINK-5782 - Support GPU calculations
> > >    https://issues.apache.org/jira/browse/FLINK-5782
> > >    Many algorithms will benefit greatly by GPU-accelerated linear
> > algebra,
> > >    to the point where if a library doesn't support it puts it at a
> severe
> > >    disadvantage compared to other offerings.
> > >
> > >
> > > These issues aside, Stephan has mentioned recently the possibility of
> > > re-structuring the Flink project to allow for more flexibility for the
> > > libraries. I think that sounds quite promising and it should allow the
> > > development to pick up in the libraries, if we can get some more people
> > > reviewing and merging PRs.
> > >
> > > I would be all for updating our vision and roadmap to match what the
> > > community desires from the library.
> > >
> > > [1]
> > >
> > > https://cwiki.apache.org/confluence/display/FLINK/
> > FlinkML%3A+Vision+and+Roadmap
> > >
> > > On Mon, Feb 20, 2017 at 12:47 PM, Gábor Hermann <mail@gaborhermann.com
> >
> > > wrote:
> > >
> > > > Hi Stavros,
> > > >
> > > > Thanks for bringing this up.
> > > >
> > > > There have been past [1] and recent [2, 3] discussions about the
> Flink
> > > > libraries, because there are some stalling PRs and overloaded
> > committers.
> > > > (Actually, Till is the only committer shepherd of the both the CEP
> and
> > ML
> > > > library, and AFAIK he has a ton of other responsibilities and work to
> > > do.)
> > > > Thus it's hard to get code reviewed and merged, and without merged
> code
> > > > it's hard to get a committer status, so there are not many committers
> > who
> > > > can review e.g. ML algorithm implementations, and the cycle goes on.
> > > Until
> > > > this is resolved somehow, we should help the committers by reviewing
> > > > each-others PRs.
> > > >
> > > > I think prioritizing features (b) is a good way to start. We could
> > > declare
> > > > most blocking features and concentrate on reviewing and merging them
> > > before
> > > > moving forward. E.g. the evaluation framework is quite important for
> an
> > > ML
> > > > library in my opinion, and has a PR stalling for long [4].
> > > >
> > > > Regarding c),  there are styleguides generally for contributing to
> > Flink,
> > > > so we should follow that. Is there something more ML specific you
> think
> > > we
> > > > could follow? We should definitely declare, we follow scikit-learn
> and
> > > make
> > > > sure contributions comply to that.
> > > >
> > > > In terms of features (a, d), I think we should first see the bigger
> > > > picture. That is, it would be nice to discuss a clearer direction for
> > > Flink
> > > > ML. I've seen a lot of interest in contributing to Flink ML lately. I
> > > > believe we should rethink our goals, to put the contribution efforts
> in
> > > > making a usable and useful library. Are we trying to implement as
> many
> > > > useful algorithms as possible to create a scalable ML library? That
> > would
> > > > seem ambitious, and of course there are a lot of frameworks and
> > libraries
> > > > that already has something like this as goal (e.g. Spark MLlib,
> > Mahout).
> > > > Should we rather create connectors to existing libraries? Then we
> > cannot
> > > > really do Flink specific optimizations. Should we go for online
> machine
> > > > learning (as Flink is concentrating on streaming)? We already have a
> > > > connector to SAMOA. We could go on with questions like this. Maybe
> I'm
> > > > missing something, but I haven't seen such directions declared.
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > > [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > > > com/Opening-a-discussion-on-FlinkML-td10265.html
> > > > [2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > > > com/Flink-CEP-development-is-stalling-td15237.html#a15341
> > > > [3] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > > > com/New-Flink-team-member-Kate-Eri-td15349.html
> > > > [4] https://github.com/apache/flink/pull/1849
> > > >
> > > >
> > > > On 2017-02-20 11:43, Stavros Kontopoulos wrote:
> > > >
> > > > (Resending with the appropriate topic)
> > > >>
> > > >> Hi,
> > > >>
> > > >> I would like to start a discussion about next steps for Flink ML.
> > > >> Currently there is a lot of work going on but needs a push forward.
> > > >>
> > > >> Some topics to discuss:
> > > >>
> > > >> a) How several features should be planned and get aligned with Flink
> > > >> releases.
> > > >> b) Priorities of what should be done.
> > > >> c) Basic guidelines for code: styleguides, scikit-learn compliance
> etc
> > > >> d) Missing features important for the success of the library, next
> > steps
> > > >> etc...
> > > >>
> > > >> Thoughts?
> > > >>
> > > >> Best,
> > > >> Stavros
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] Flink ML roadmap

Posted by Timur Shenkao <ts...@timshenkao.su>.

Hello guys,

My couple of cents.
All Flink presentations, articles, etc. articulate that Flink is for ETL,
data ingestion. CEP is a maximum.
If you visit http://flink.apache.org/usecases.html, you'll there aren't any
explicit ML or Graphs there.
It's also stated that Flink is suitable when "Data that is processed
quickly".
That's why people believe that Flink isn't for ML or don't even know that
Flink has such algorithms.
Then, folks decide: "I would better use old good Spark or scikit-learn than
dive into Flink's internals & implement algo by myself "

Sincerely yours, Timur

On Mon, Feb 20, 2017 at 1:53 PM, Katherin Eri <ka...@gmail.com>
wrote:

> Hello guys,
>
>
> May be we will be able to focus our forces on some E2E scenario or show
> case for Flink as also ML supporting engine, and in such a way actualize
> the roadmap?
>
>
> This means: we can take some real life/production problem, like Fraud
> detection in some area, and try to solve this problem from the point of
> view of DataScience.
>
> Starting from data preprocessing and preparation, finishing
> implementation/usage of some ML algorithm.
>
> Doing this we will understand which issues are showstopper for
> implementation of such functionality. We will be able to understand Flink’s
> users better.
>
>
> May be community could share its ideas which show case could be the most
> useful for Apache Flink, or may be Data artisans could lead this?
>
> пн, 20 февр. 2017 г. в 15:28, Theodore Vasiloudis <
> theodoros.vasiloudis@gmail.com>:
>
> > Hello all,
> >
> > thank you for opening this discussion Stavros, note that it's almost
> > exactly 1 year since I last opened such a topic (linked by Gabor) and the
> > comments there are still relevant.
> >
> > I think Gabor described the current state quite well, development in the
> > libraries is hard without committers dedicated to each project, and as a
> > result FlinkML and CEP have stalled.
> >
> > I think it's important to look at why development has stalled as well. As
> > people have mentioned there's a multitude of ML libraries out there and
> my
> > impression was that not many people are looking to use Flink for ML.
> Lately
> > that seems to have changed (with some interest shown in the Flink survey
> as
> > well).
> >
> > Gabor makes some good points about future directions for the library. Our
> > initial goal [1] was to make a truly scalable, easy to use library,
> within
> > the Flink ecosystem, providing a set of "workhorse" algorithms, sampled
> > from what's actually being used in the industry. We planned for a library
> > that has few algorithms, but does them properly.
> >
> > If we decide to go the way of focusing within Flink we face some major
> > challenges, because these are system limitations that do not necessarily
> > align with the goals of the community. Some issues relevant to ML on
> Flink
> > are:
> >
> >    - FLINK-2396 - Review the datasets of dynamic path and static path in
> >    iteration.
> >    https://issues.apache.org/jira/browse/FLINK-2396
> >    This has to do with the ability to iterate over one datset (model)
> while
> >    changing another (dataset), which is necessary for many ML algorithms
> > like
> >    SGD.
> >    - FLINK-1730 - Add a FlinkTools.persist style method to the Data Set.
> >    https://issues.apache.org/jira/browse/FLINK-1730
> >    This is again relevant to many algorithms, to create intermediate
> >    results etc, for example L-BFGS development has been attempted 2-3
> > times,
> >    but always abandoned because of the need to collect a DataSet kills
> the
> >    performance.
> >    - FLINK-5782 - Support GPU calculations
> >    https://issues.apache.org/jira/browse/FLINK-5782
> >    Many algorithms will benefit greatly by GPU-accelerated linear
> algebra,
> >    to the point where if a library doesn't support it puts it at a severe
> >    disadvantage compared to other offerings.
> >
> >
> > These issues aside, Stephan has mentioned recently the possibility of
> > re-structuring the Flink project to allow for more flexibility for the
> > libraries. I think that sounds quite promising and it should allow the
> > development to pick up in the libraries, if we can get some more people
> > reviewing and merging PRs.
> >
> > I would be all for updating our vision and roadmap to match what the
> > community desires from the library.
> >
> > [1]
> >
> > https://cwiki.apache.org/confluence/display/FLINK/
> FlinkML%3A+Vision+and+Roadmap
> >
> > On Mon, Feb 20, 2017 at 12:47 PM, Gábor Hermann <ma...@gaborhermann.com>
> > wrote:
> >
> > > Hi Stavros,
> > >
> > > Thanks for bringing this up.
> > >
> > > There have been past [1] and recent [2, 3] discussions about the Flink
> > > libraries, because there are some stalling PRs and overloaded
> committers.
> > > (Actually, Till is the only committer shepherd of the both the CEP and
> ML
> > > library, and AFAIK he has a ton of other responsibilities and work to
> > do.)
> > > Thus it's hard to get code reviewed and merged, and without merged code
> > > it's hard to get a committer status, so there are not many committers
> who
> > > can review e.g. ML algorithm implementations, and the cycle goes on.
> > Until
> > > this is resolved somehow, we should help the committers by reviewing
> > > each-others PRs.
> > >
> > > I think prioritizing features (b) is a good way to start. We could
> > declare
> > > most blocking features and concentrate on reviewing and merging them
> > before
> > > moving forward. E.g. the evaluation framework is quite important for an
> > ML
> > > library in my opinion, and has a PR stalling for long [4].
> > >
> > > Regarding c),  there are styleguides generally for contributing to
> Flink,
> > > so we should follow that. Is there something more ML specific you think
> > we
> > > could follow? We should definitely declare, we follow scikit-learn and
> > make
> > > sure contributions comply to that.
> > >
> > > In terms of features (a, d), I think we should first see the bigger
> > > picture. That is, it would be nice to discuss a clearer direction for
> > Flink
> > > ML. I've seen a lot of interest in contributing to Flink ML lately. I
> > > believe we should rethink our goals, to put the contribution efforts in
> > > making a usable and useful library. Are we trying to implement as many
> > > useful algorithms as possible to create a scalable ML library? That
> would
> > > seem ambitious, and of course there are a lot of frameworks and
> libraries
> > > that already has something like this as goal (e.g. Spark MLlib,
> Mahout).
> > > Should we rather create connectors to existing libraries? Then we
> cannot
> > > really do Flink specific optimizations. Should we go for online machine
> > > learning (as Flink is concentrating on streaming)? We already have a
> > > connector to SAMOA. We could go on with questions like this. Maybe I'm
> > > missing something, but I haven't seen such directions declared.
> > >
> > > Cheers,
> > > Gabor
> > >
> > > [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > > com/Opening-a-discussion-on-FlinkML-td10265.html
> > > [2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > > com/Flink-CEP-development-is-stalling-td15237.html#a15341
> > > [3] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > > com/New-Flink-team-member-Kate-Eri-td15349.html
> > > [4] https://github.com/apache/flink/pull/1849
> > >
> > >
> > > On 2017-02-20 11:43, Stavros Kontopoulos wrote:
> > >
> > > (Resending with the appropriate topic)
> > >>
> > >> Hi,
> > >>
> > >> I would like to start a discussion about next steps for Flink ML.
> > >> Currently there is a lot of work going on but needs a push forward.
> > >>
> > >> Some topics to discuss:
> > >>
> > >> a) How several features should be planned and get aligned with Flink
> > >> releases.
> > >> b) Priorities of what should be done.
> > >> c) Basic guidelines for code: styleguides, scikit-learn compliance etc
> > >> d) Missing features important for the success of the library, next
> steps
> > >> etc...
> > >>
> > >> Thoughts?
> > >>
> > >> Best,
> > >> Stavros
> > >>
> > >>
> > >
> >
>

Re: [DISCUSS] Flink ML roadmap

Posted by Katherin Eri <ka...@gmail.com>.

Hello guys,


May be we will be able to focus our forces on some E2E scenario or show
case for Flink as also ML supporting engine, and in such a way actualize
the roadmap?


This means: we can take some real life/production problem, like Fraud
detection in some area, and try to solve this problem from the point of
view of DataScience.

Starting from data preprocessing and preparation, finishing
implementation/usage of some ML algorithm.

Doing this we will understand which issues are showstopper for
implementation of such functionality. We will be able to understand Flink’s
users better.


May be community could share its ideas which show case could be the most
useful for Apache Flink, or may be Data artisans could lead this?

пн, 20 февр. 2017 г. в 15:28, Theodore Vasiloudis <
theodoros.vasiloudis@gmail.com>:

> Hello all,
>
> thank you for opening this discussion Stavros, note that it's almost
> exactly 1 year since I last opened such a topic (linked by Gabor) and the
> comments there are still relevant.
>
> I think Gabor described the current state quite well, development in the
> libraries is hard without committers dedicated to each project, and as a
> result FlinkML and CEP have stalled.
>
> I think it's important to look at why development has stalled as well. As
> people have mentioned there's a multitude of ML libraries out there and my
> impression was that not many people are looking to use Flink for ML. Lately
> that seems to have changed (with some interest shown in the Flink survey as
> well).
>
> Gabor makes some good points about future directions for the library. Our
> initial goal [1] was to make a truly scalable, easy to use library, within
> the Flink ecosystem, providing a set of "workhorse" algorithms, sampled
> from what's actually being used in the industry. We planned for a library
> that has few algorithms, but does them properly.
>
> If we decide to go the way of focusing within Flink we face some major
> challenges, because these are system limitations that do not necessarily
> align with the goals of the community. Some issues relevant to ML on Flink
> are:
>
>    - FLINK-2396 - Review the datasets of dynamic path and static path in
>    iteration.
>    https://issues.apache.org/jira/browse/FLINK-2396
>    This has to do with the ability to iterate over one datset (model) while
>    changing another (dataset), which is necessary for many ML algorithms
> like
>    SGD.
>    - FLINK-1730 - Add a FlinkTools.persist style method to the Data Set.
>    https://issues.apache.org/jira/browse/FLINK-1730
>    This is again relevant to many algorithms, to create intermediate
>    results etc, for example L-BFGS development has been attempted 2-3
> times,
>    but always abandoned because of the need to collect a DataSet kills the
>    performance.
>    - FLINK-5782 - Support GPU calculations
>    https://issues.apache.org/jira/browse/FLINK-5782
>    Many algorithms will benefit greatly by GPU-accelerated linear algebra,
>    to the point where if a library doesn't support it puts it at a severe
>    disadvantage compared to other offerings.
>
>
> These issues aside, Stephan has mentioned recently the possibility of
> re-structuring the Flink project to allow for more flexibility for the
> libraries. I think that sounds quite promising and it should allow the
> development to pick up in the libraries, if we can get some more people
> reviewing and merging PRs.
>
> I would be all for updating our vision and roadmap to match what the
> community desires from the library.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FlinkML%3A+Vision+and+Roadmap
>
> On Mon, Feb 20, 2017 at 12:47 PM, Gábor Hermann <ma...@gaborhermann.com>
> wrote:
>
> > Hi Stavros,
> >
> > Thanks for bringing this up.
> >
> > There have been past [1] and recent [2, 3] discussions about the Flink
> > libraries, because there are some stalling PRs and overloaded committers.
> > (Actually, Till is the only committer shepherd of the both the CEP and ML
> > library, and AFAIK he has a ton of other responsibilities and work to
> do.)
> > Thus it's hard to get code reviewed and merged, and without merged code
> > it's hard to get a committer status, so there are not many committers who
> > can review e.g. ML algorithm implementations, and the cycle goes on.
> Until
> > this is resolved somehow, we should help the committers by reviewing
> > each-others PRs.
> >
> > I think prioritizing features (b) is a good way to start. We could
> declare
> > most blocking features and concentrate on reviewing and merging them
> before
> > moving forward. E.g. the evaluation framework is quite important for an
> ML
> > library in my opinion, and has a PR stalling for long [4].
> >
> > Regarding c),  there are styleguides generally for contributing to Flink,
> > so we should follow that. Is there something more ML specific you think
> we
> > could follow? We should definitely declare, we follow scikit-learn and
> make
> > sure contributions comply to that.
> >
> > In terms of features (a, d), I think we should first see the bigger
> > picture. That is, it would be nice to discuss a clearer direction for
> Flink
> > ML. I've seen a lot of interest in contributing to Flink ML lately. I
> > believe we should rethink our goals, to put the contribution efforts in
> > making a usable and useful library. Are we trying to implement as many
> > useful algorithms as possible to create a scalable ML library? That would
> > seem ambitious, and of course there are a lot of frameworks and libraries
> > that already has something like this as goal (e.g. Spark MLlib, Mahout).
> > Should we rather create connectors to existing libraries? Then we cannot
> > really do Flink specific optimizations. Should we go for online machine
> > learning (as Flink is concentrating on streaming)? We already have a
> > connector to SAMOA. We could go on with questions like this. Maybe I'm
> > missing something, but I haven't seen such directions declared.
> >
> > Cheers,
> > Gabor
> >
> > [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > com/Opening-a-discussion-on-FlinkML-td10265.html
> > [2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > com/Flink-CEP-development-is-stalling-td15237.html#a15341
> > [3] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > com/New-Flink-team-member-Kate-Eri-td15349.html
> > [4] https://github.com/apache/flink/pull/1849
> >
> >
> > On 2017-02-20 11:43, Stavros Kontopoulos wrote:
> >
> > (Resending with the appropriate topic)
> >>
> >> Hi,
> >>
> >> I would like to start a discussion about next steps for Flink ML.
> >> Currently there is a lot of work going on but needs a push forward.
> >>
> >> Some topics to discuss:
> >>
> >> a) How several features should be planned and get aligned with Flink
> >> releases.
> >> b) Priorities of what should be done.
> >> c) Basic guidelines for code: styleguides, scikit-learn compliance etc
> >> d) Missing features important for the success of the library, next steps
> >> etc...
> >>
> >> Thoughts?
> >>
> >> Best,
> >> Stavros
> >>
> >>
> >
>

Re: [DISCUSS] Flink ML roadmap

Posted by Theodore Vasiloudis <th...@gmail.com>.

Hello all,

thank you for opening this discussion Stavros, note that it's almost
exactly 1 year since I last opened such a topic (linked by Gabor) and the
comments there are still relevant.

I think Gabor described the current state quite well, development in the
libraries is hard without committers dedicated to each project, and as a
result FlinkML and CEP have stalled.

I think it's important to look at why development has stalled as well. As
people have mentioned there's a multitude of ML libraries out there and my
impression was that not many people are looking to use Flink for ML. Lately
that seems to have changed (with some interest shown in the Flink survey as
well).

Gabor makes some good points about future directions for the library. Our
initial goal [1] was to make a truly scalable, easy to use library, within
the Flink ecosystem, providing a set of "workhorse" algorithms, sampled
from what's actually being used in the industry. We planned for a library
that has few algorithms, but does them properly.

If we decide to go the way of focusing within Flink we face some major
challenges, because these are system limitations that do not necessarily
align with the goals of the community. Some issues relevant to ML on Flink
are:

   - FLINK-2396 - Review the datasets of dynamic path and static path in
   iteration.
   https://issues.apache.org/jira/browse/FLINK-2396
   This has to do with the ability to iterate over one datset (model) while
   changing another (dataset), which is necessary for many ML algorithms like
   SGD.
   - FLINK-1730 - Add a FlinkTools.persist style method to the Data Set.
   https://issues.apache.org/jira/browse/FLINK-1730
   This is again relevant to many algorithms, to create intermediate
   results etc, for example L-BFGS development has been attempted 2-3 times,
   but always abandoned because of the need to collect a DataSet kills the
   performance.
   - FLINK-5782 - Support GPU calculations
   https://issues.apache.org/jira/browse/FLINK-5782
   Many algorithms will benefit greatly by GPU-accelerated linear algebra,
   to the point where if a library doesn't support it puts it at a severe
   disadvantage compared to other offerings.


These issues aside, Stephan has mentioned recently the possibility of
re-structuring the Flink project to allow for more flexibility for the
libraries. I think that sounds quite promising and it should allow the
development to pick up in the libraries, if we can get some more people
reviewing and merging PRs.

I would be all for updating our vision and roadmap to match what the
community desires from the library.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FlinkML%3A+Vision+and+Roadmap

On Mon, Feb 20, 2017 at 12:47 PM, Gábor Hermann <ma...@gaborhermann.com>
wrote:

> Hi Stavros,
>
> Thanks for bringing this up.
>
> There have been past [1] and recent [2, 3] discussions about the Flink
> libraries, because there are some stalling PRs and overloaded committers.
> (Actually, Till is the only committer shepherd of the both the CEP and ML
> library, and AFAIK he has a ton of other responsibilities and work to do.)
> Thus it's hard to get code reviewed and merged, and without merged code
> it's hard to get a committer status, so there are not many committers who
> can review e.g. ML algorithm implementations, and the cycle goes on. Until
> this is resolved somehow, we should help the committers by reviewing
> each-others PRs.
>
> I think prioritizing features (b) is a good way to start. We could declare
> most blocking features and concentrate on reviewing and merging them before
> moving forward. E.g. the evaluation framework is quite important for an ML
> library in my opinion, and has a PR stalling for long [4].
>
> Regarding c),  there are styleguides generally for contributing to Flink,
> so we should follow that. Is there something more ML specific you think we
> could follow? We should definitely declare, we follow scikit-learn and make
> sure contributions comply to that.
>
> In terms of features (a, d), I think we should first see the bigger
> picture. That is, it would be nice to discuss a clearer direction for Flink
> ML. I've seen a lot of interest in contributing to Flink ML lately. I
> believe we should rethink our goals, to put the contribution efforts in
> making a usable and useful library. Are we trying to implement as many
> useful algorithms as possible to create a scalable ML library? That would
> seem ambitious, and of course there are a lot of frameworks and libraries
> that already has something like this as goal (e.g. Spark MLlib, Mahout).
> Should we rather create connectors to existing libraries? Then we cannot
> really do Flink specific optimizations. Should we go for online machine
> learning (as Flink is concentrating on streaming)? We already have a
> connector to SAMOA. We could go on with questions like this. Maybe I'm
> missing something, but I haven't seen such directions declared.
>
> Cheers,
> Gabor
>
> [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> com/Opening-a-discussion-on-FlinkML-td10265.html
> [2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> com/Flink-CEP-development-is-stalling-td15237.html#a15341
> [3] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> com/New-Flink-team-member-Kate-Eri-td15349.html
> [4] https://github.com/apache/flink/pull/1849
>
>
> On 2017-02-20 11:43, Stavros Kontopoulos wrote:
>
> (Resending with the appropriate topic)
>>
>> Hi,
>>
>> I would like to start a discussion about next steps for Flink ML.
>> Currently there is a lot of work going on but needs a push forward.
>>
>> Some topics to discuss:
>>
>> a) How several features should be planned and get aligned with Flink
>> releases.
>> b) Priorities of what should be done.
>> c) Basic guidelines for code: styleguides, scikit-learn compliance etc
>> d) Missing features important for the success of the library, next steps
>> etc...
>>
>> Thoughts?
>>
>> Best,
>> Stavros
>>
>>
>

Re: [DISCUSS] Flink ML roadmap

Posted by Gábor Hermann <ma...@gaborhermann.com>.

Hi Stavros,

Thanks for bringing this up.

There have been past [1] and recent [2, 3] discussions about the Flink 
libraries, because there are some stalling PRs and overloaded 
committers. (Actually, Till is the only committer shepherd of the both 
the CEP and ML library, and AFAIK he has a ton of other responsibilities 
and work to do.) Thus it's hard to get code reviewed and merged, and 
without merged code it's hard to get a committer status, so there are 
not many committers who can review e.g. ML algorithm implementations, 
and the cycle goes on. Until this is resolved somehow, we should help 
the committers by reviewing each-others PRs.

I think prioritizing features (b) is a good way to start. We could 
declare most blocking features and concentrate on reviewing and merging 
them before moving forward. E.g. the evaluation framework is quite 
important for an ML library in my opinion, and has a PR stalling for 
long [4].

Regarding c),  there are styleguides generally for contributing to 
Flink, so we should follow that. Is there something more ML specific you 
think we could follow? We should definitely declare, we follow 
scikit-learn and make sure contributions comply to that.

In terms of features (a, d), I think we should first see the bigger 
picture. That is, it would be nice to discuss a clearer direction for 
Flink ML. I've seen a lot of interest in contributing to Flink ML 
lately. I believe we should rethink our goals, to put the contribution 
efforts in making a usable and useful library. Are we trying to 
implement as many useful algorithms as possible to create a scalable ML 
library? That would seem ambitious, and of course there are a lot of 
frameworks and libraries that already has something like this as goal 
(e.g. Spark MLlib, Mahout). Should we rather create connectors to 
existing libraries? Then we cannot really do Flink specific 
optimizations. Should we go for online machine learning (as Flink is 
concentrating on streaming)? We already have a connector to SAMOA. We 
could go on with questions like this. Maybe I'm missing something, but I 
haven't seen such directions declared.

Cheers,
Gabor

[1] 
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Opening-a-discussion-on-FlinkML-td10265.html
[2] 
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Flink-CEP-development-is-stalling-td15237.html#a15341
[3] 
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/New-Flink-team-member-Kate-Eri-td15349.html
[4] https://github.com/apache/flink/pull/1849

On 2017-02-20 11:43, Stavros Kontopoulos wrote:

> (Resending with the appropriate topic)
>
> Hi,
>
> I would like to start a discussion about next steps for Flink ML.
> Currently there is a lot of work going on but needs a push forward.
>
> Some topics to discuss:
>
> a) How several features should be planned and get aligned with Flink
> releases.
> b) Priorities of what should be done.
> c) Basic guidelines for code: styleguides, scikit-learn compliance etc
> d) Missing features important for the success of the library, next steps
> etc...
>
> Thoughts?
>
> Best,
> Stavros
>