You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Katherin Eri <ka...@gmail.com> on 2017/02/06 09:49:14 UTC

Re: New Flink team member - Kate Eri.

Hello, guys.
Theodore, last week I started the review of the PR:
https://github.com/apache/flink/pull/2735 related to *word2Vec for Flink*.

During this review I have asked myself: why do we need to implement such a
very popular algorithm like *word2vec one more time*, when there is already
availabe implementation in java provided by deeplearning4j.org
<https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2 licence).
This library tries to promote it self, there is a hype around it in ML
sphere, and  it was integrated with Apache Spark, to provide scalable
deeplearning calculations.
That's why I thought: could we integrate with this library or not also and
Flink?
1) Personally I think, providing support and deployment of Deeplearning
algorithms/models in Flink is promising and attractive feature, because:
    a) during last two years deeplearning proved its efficiency and this
algorithms used in many applications. For example *Spotify *uses DL based
algorithms for music content extraction: Recommending music on Spotify with
deep learning AUGUST 05, 2014
<http://benanne.github.io/2014/08/05/spotify-cnns.html> for their music
recommendations. Doing this natively scalable is very attractive.


I have investigated that implementation of integration DL4J with Apache
Spark, and got several points:

1) It seems that idea of building of our own implementation of word2vec not
such a bad solution, because the integration of DL4J with Spark is too
strongly coupled with Saprk API and it will take time from the side of DL4J
to adopt this integration to Flink. Also I have expected that we will be
able to call just some API, it is not such thing.
2)

https://deeplearning4j.org/use_cases
https://www.analyticsvidhya.com/blog/2017/01/t-sne-implementation-r-python/

чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <tr...@apache.org>:

Hi Katherin,

welcome to the Flink community. Always great to see new people joining the
community :-)

Cheers,
Till

On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <ka...@gmail.com>
wrote:

> ok, I've got it.
> I will take a look at  https://github.com/apache/flink/pull/2735.
>
> вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> theodoros.vasiloudis@gmail.com>:
>
> > Hello Katherin,
> >
> > Welcome to the Flink community!
> >
> > The ML component definitely needs a lot of work you are correct, we are
> > facing similar problems to CEP, which we'll hopefully resolve with the
> > restructuring Stephan has mentioned in that thread.
> >
> > If you'd like to help out with PRs we have many open, one I have started
> > reviewing but got side-tracked is the Word2Vec one [1].
> >
> > Best,
> > Theodore
> >
> > [1] https://github.com/apache/flink/pull/2735
> >
> > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <fh...@gmail.com>
> wrote:
> >
> > > Hi Katherin,
> > >
> > > welcome to the Flink community!
> > > Help with reviewing PRs is always very welcome and a great way to
> > > contribute.
> > >
> > > Best, Fabian
> > >
> > >
> > >
> > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <ka...@gmail.com>:
> > >
> > > > Thank you, Timo.
> > > > I have started the analysis of the topic.
> > > > And if it necessary, I will try to perform the review of other
pulls)
> > > >
> > > >
> > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <tw...@apache.org>:
> > > >
> > > > > Hi Katherin,
> > > > >
> > > > > great to hear that you would like to contribute! Welcome!
> > > > >
> > > > > I gave you contributor permissions. You can now assign issues to
> > > > > yourself. I assigned FLINK-1750 to you.
> > > > > Right now there are many open ML pull requests, you are very
> welcome
> > to
> > > > > review the code of others, too.
> > > > >
> > > > > Timo
> > > > >
> > > > >
> > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> > > > > > Hello, All!
> > > > > >
> > > > > >
> > > > > >
> > > > > > I'm Kate Eri, I'm java developer with 6-year enterprise
> experience,
> > > > also
> > > > > I
> > > > > > have some expertise with scala (half of the year).
> > > > > >
> > > > > > Last 2 years I have participated in several BigData projects
that
> > > were
> > > > > > related to Machine Learning (Time series analysis, Recommender
> > > systems,
> > > > > > Social networking) and ETL. I have experience with Hadoop,
Apache
> > > Spark
> > > > > and
> > > > > > Hive.
> > > > > >
> > > > > >
> > > > > > I’m fond of ML topic, and I see that Flink project requires some
> > work
> > > > in
> > > > > > this area, so that’s why I would like to join Flink and ask me
to
> > > grant
> > > > > the
> > > > > > assignment of the ticket
> > > > > https://issues.apache.org/jira/browse/FLINK-1750
> > > > > > to me.
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: New Flink team member - Kate Eri.

Posted by Felix Neutatz <ne...@googlemail.com>.

Hi Kate,

that's great news. This would help to boost ML on Flink a lot :)

Best regards,
Felix

2017-02-13 14:09 GMT+01:00 Katherin Eri <ka...@gmail.com>:

> Hello guys,
>
>
>
> It seems that issue FLINK-1730
> <https://issues.apache.org/jira/browse/FLINK-1730> significantly impacts
> integration of Flink with SystemML.
>
> They have checked several integrations, and Flink’s integration is slowest
> <https://github.com/apache/incubator-systemml/pull/119#
> issuecomment-222059794>
> :
>
>    - MR: LinregDS: 147s (2 jobs); LinregCG w/ 6 iterations: 361s (8 jobs)
>    w/ mmchain; 628s (14 jobs) w/o mmchain
>    - Spark: LinregDS: 71s (3 jobs); LinregCG w/ 6 iterations: 41s (8 jobs)
>    w/ mmchain; 48s (14 jobs) w/o mmchain
>    - Flink: LinregDS: 212s (3 jobs); LinregCG w/ 6 iterations: 1,047s (14
>    jobs) w/o mmchain
>
> This fact is caused, as already Felix said, by two reasons:
>
> 1)      FLINK-1730 <https://issues.apache.org/jira/browse/FLINK-1730>
>
> 2)      FLINK-4175 <https://issues.apache.org/jira/browse/FLINK-4175>
>
> As far as FLINK-1730 is not assigned to anyone we would like to take this
> ticket to work (my colleges could try to implement it).
>
> Further discussion of the topic related to FLINK-1730 I would like to
> handle in appropriate ticket.
>
>
> пт, 10 февр. 2017 г. в 19:57, Katherin Eri <ka...@gmail.com>:
>
> > I have created the ticket to discuss GPU related questions futher
> > https://issues.apache.org/jira/browse/FLINK-5782
> >
> > пт, 10 февр. 2017 г. в 18:16, Katherin Eri <ka...@gmail.com>:
> >
> > Thank you, Trevor!
> >
> > You have shared very valuable points; I will consider them.
> >
> > So I think, I should create finally ticket at Flink’s JIRA, at least for
> > Flink's GPU support and move the related discussion there?
> >
> > I will contact to Suneel regarding DL4J, thanks!
> >
> >
> > пт, 10 февр. 2017 г. в 17:44, Trevor Grant <tr...@gmail.com>:
> >
> > Also RE: DL4J integration.
> >
> > Suneel had done some work on this a while back, and ran into issues.  You
> > might want to chat with him about the pitfalls and 'gotchyas' there.
> >
> >
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Fri, Feb 10, 2017 at 7:37 AM, Trevor Grant <tr...@gmail.com>
> > wrote:
> >
> > > Sorry for chiming in late.
> > >
> > > GPUs on Flink.  Till raised a good point- you need to be able to fall
> > back
> > > to non-GPU resources if they aren't available.
> > >
> > > Fun fact: this has already been developed for Flink vis-a-vis the
> Apache
> > > Mahout project.
> > >
> > > In short- Mahout exposes a number of tensor functions (vector %*%
> matrix,
> > > matrix %*% matrix, etc).  If compiled for GPU support, those operations
> > are
> > > completed via GPU- and if no GPUs are in fact available, Mahout math
> > falls
> > > back to CPUs (and finally back to the JVM).
> > >
> > > How this should work is Flink takes care of shipping data around the
> > > cluster, and when data arrives at the local node- is dumped out to GPU
> > for
> > > calculation, loaded back up and shipped back around cluster.  In
> > practice,
> > > the lack of a persist method for intermediate results makes this
> > > troublesome (not because of GPUs but for calculating any sort of
> complex
> > > algorithm we expect to be able to cache intermediate results).
> > >
> > > +1 to FLINK-1730
> > >
> > > Everything in Mahout is modular- distributed engine
> > > (Flink/Spark/Write-your-own), Native Solvers (OpenMP / ViennaCL / CUDA
> /
> > > Write-your-own), algorithms, etc.
> > >
> > > So to sum up, you're noting the redundancy between ML packages in terms
> > of
> > > algorithms- I would recommend checking out Mahout before rolling your
> own
> > > GPU integration (else risk redundantly integrating GPUs). If nothing
> > else-
> > > it should give you some valuable insight regarding design
> considerations.
> > > Also FYI the goal of the Apache Mahout project is to address that
> problem
> > > precisely- implement an algorithm once in a mathematically expressive
> > DSL,
> > > which is abstracted above the engine so the same code easily ports
> > between
> > > engines / native solvers (i.e. CPU/GPU).
> > >
> > > https://github.com/apache/mahout/tree/master/viennacl-omp
> > > https://github.com/apache/mahout/tree/master/viennacl
> > >
> > > Best,
> > > tg
> > >
> > >
> > > Trevor Grant
> > > Data Scientist
> > > https://github.com/rawkintrevo
> > > http://stackexchange.com/users/3002022/rawkintrevo
> > > http://trevorgrant.org
> > >
> > > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> > >
> > >
> > > On Fri, Feb 10, 2017 at 7:01 AM, Katherin Eri <ka...@gmail.com>
> > > wrote:
> > >
> > >> Thank you Felix, for provided information.
> > >>
> > >> Currently I analyze the provided integration of Flink with SystemML.
> > >>
> > >> And also gather the information for the ticket  FLINK-1730
> > >> <https://issues.apache.org/jira/browse/FLINK-1730>, maybe we will
> take
> > it
> > >> to work, to unlock SystemML/Flink integration.
> > >>
> > >>
> > >>
> > >> чт, 9 февр. 2017 г. в 0:17, Felix Neutatz
> <neutatz@googlemail.com.invali
> > >> d>:
> > >>
> > >> > Hi Kate,
> > >> >
> > >> > 1) - Broadcast:
> > >> >
> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-5%3A+
> > >> Only+send+data+to+each+taskmanager+once+for+broadcasts
> > >> >  - Caching: https://issues.apache.org/jira/browse/FLINK-1730
> > >> >
> > >> > 2) I have no idea about the GPU implementation. The SystemML mailing
> > >> list
> > >> > will probably help you out their.
> > >> >
> > >> > Best regards,
> > >> > Felix
> > >> >
> > >> > 2017-02-08 14:33 GMT+01:00 Katherin Eri <ka...@gmail.com>:
> > >> >
> > >> > > Thank you Felix, for your point, it is quite interesting.
> > >> > >
> > >> > > I will take a look at the code, of the provided Flink integration.
> > >> > >
> > >> > > 1)    You have these problems with Flink: >>we realized that the
> > lack
> > >> of
> > >> > a
> > >> > > caching operator and a broadcast issue highly effects the
> > performance,
> > >> > have
> > >> > > you already asked about this the community? In case yes: please
> > >> provide
> > >> > the
> > >> > > reference to the ticket or the topic of letter.
> > >> > >
> > >> > > 2)    You have said, that SystemML provides GPU support. I have
> seen
> > >> > > SystemML’s source code and would like to ask: why you have decided
> > to
> > >> > > implement your own integration with cuda? Did you try to consider
> > >> ND4J,
> > >> > or
> > >> > > because it is younger, you support your own implementation?
> > >> > >
> > >> > > вт, 7 февр. 2017 г. в 18:35, Felix Neutatz <
> neutatz@googlemail.com
> > >:
> > >> > >
> > >> > > > Hi Katherin,
> > >> > > >
> > >> > > > we are also working in a similar direction. We implemented a
> > >> prototype
> > >> > to
> > >> > > > integrate with SystemML:
> > >> > > > https://github.com/apache/incubator-systemml/pull/119
> > >> > > > SystemML provides many different matrix formats, operations, GPU
> > >> > support
> > >> > > > and a couple of DL algorithms. Unfortunately, we realized that
> the
> > >> lack
> > >> > > of
> > >> > > > a caching operator and a broadcast issue highly effects the
> > >> performance
> > >> > > > (e.g. compared to Spark). At the moment I am trying to tackle
> the
> > >> > > broadcast
> > >> > > > issue. But caching is still a problem for us.
> > >> > > >
> > >> > > > Best regards,
> > >> > > > Felix
> > >> > > >
> > >> > > > 2017-02-07 16:22 GMT+01:00 Katherin Eri <katherinmail@gmail.com
> >:
> > >> > > >
> > >> > > > > Thank you, Till.
> > >> > > > >
> > >> > > > > 1)      Regarding ND4J, I didn’t know about such a pity and
> > >> critical
> > >> > > > > restriction of it -> lack of sparsity optimizations, and you
> are
> > >> > right:
> > >> > > > > this issue is still actual for them. I saw that Flink uses
> > Breeze,
> > >> > but
> > >> > > I
> > >> > > > > thought its usage caused by some historical reasons.
> > >> > > > >
> > >> > > > > 2)      Regarding integration with DL4J, I have read the
> source
> > >> code
> > >> > of
> > >> > > > > DL4J/Spark integration, that’s why I have declined my idea of
> > >> reuse
> > >> > of
> > >> > > > > their word2vec implementation for now, for example. I can
> > perform
> > >> > > deeper
> > >> > > > > investigation of this topic, if it required.
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > So I feel that we have the following picture:
> > >> > > > >
> > >> > > > > 1)      DL integration investigation, could be part of Apache
> > >> Bahir.
> > >> > I
> > >> > > > can
> > >> > > > > perform futher investigation of this topic, but I thik we need
> > >> some
> > >> > > > > separated ticket for this to track this activity.
> > >> > > > >
> > >> > > > > 2)      GPU support, required for DL is interesting, but
> > requires
> > >> > ND4J
> > >> > > > for
> > >> > > > > example.
> > >> > > > >
> > >> > > > > 3)      ND4J couldn’t be incorporated because it doesn’t
> support
> > >> > > sparsity
> > >> > > > > <https://deeplearning4j.org/roadmap.html> [1].
> > >> > > > >
> > >> > > > > Regarding ND4J is this the single blocker for incorporation of
> > it
> > >> or
> > >> > > may
> > >> > > > be
> > >> > > > > some others known?
> > >> > > > >
> > >> > > > >
> > >> > > > > [1] https://deeplearning4j.org/roadmap.html
> > >> > > > >
> > >> > > > > вт, 7 февр. 2017 г. в 16:26, Till Rohrmann <
> > trohrmann@apache.org
> > >> >:
> > >> > > > >
> > >> > > > > Thanks for initiating this discussion Katherin. I think you're
> > >> right
> > >> > > that
> > >> > > > > in general it does not make sense to reinvent the wheel over
> and
> > >> over
> > >> > > > > again. Especially if you only have limited resources at hand.
> So
> > >> if
> > >> > we
> > >> > > > > could integrate Flink with some existing library that would be
> > >> great.
> > >> > > > >
> > >> > > > > In the past, however, we couldn't find a good library which
> > >> provided
> > >> > > > enough
> > >> > > > > freedom to integrate it with Flink. Especially if you want to
> > have
> > >> > > > > distributed and somewhat high-performance implementations of
> ML
> > >> > > > algorithms
> > >> > > > > you would have to take Flink's execution model (capabilities
> as
> > >> well
> > >> > as
> > >> > > > > limitations) into account. That is mainly the reason why we
> > >> started
> > >> > > > > implementing some of the algorithms "natively" on Flink.
> > >> > > > >
> > >> > > > > If I remember correctly, then the problem with ND4J was and
> > still
> > >> is
> > >> > > that
> > >> > > > > it does not support sparse matrices which was a requirement
> from
> > >> our
> > >> > > > side.
> > >> > > > > As far as I know, it is quite common that you have sparse data
> > >> > > structures
> > >> > > > > when dealing with large scale problems. That's why we built
> our
> > >> own
> > >> > > > > abstraction which can have different implementations.
> Currently,
> > >> the
> > >> > > > > default implementation uses Breeze.
> > >> > > > >
> > >> > > > > I think the support for GPU based operations and the actual
> > >> resource
> > >> > > > > management are two orthogonal things. The implementation would
> > >> have
> > >> > to
> > >> > > > work
> > >> > > > > with no GPUs available anyway. If the system detects that GPUs
> > are
> > >> > > > > available, then ideally it would exploit them. Thus, we could
> > add
> > >> > this
> > >> > > > > feature later and maybe integrate it with FLINK-5131 [1].
> > >> > > > >
> > >> > > > > Concerning the integration with DL4J I think that Theo's
> > proposal
> > >> to
> > >> > do
> > >> > > > it
> > >> > > > > in a separate repository (maybe as part of Apache Bahir) is a
> > good
> > >> > > idea.
> > >> > > > > We're currently thinking about outsourcing some of Flink's
> > >> libraries
> > >> > > into
> > >> > > > > sub projects. This could also be an option for the DL4J
> > >> integration
> > >> > > then.
> > >> > > > > In general I think it should be feasible to run DL4J on Flink
> > >> given
> > >> > > that
> > >> > > > it
> > >> > > > > also runs on Spark. Have you already looked at it closer?
> > >> > > > >
> > >> > > > > [1] https://issues.apache.org/jira/browse/FLINK-5131
> > >> > > > >
> > >> > > > > Cheers,
> > >> > > > > Till
> > >> > > > >
> > >> > > > > On Tue, Feb 7, 2017 at 11:47 AM, Katherin Eri <
> > >> > katherinmail@gmail.com>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Thank you Theodore, for your reply.
> > >> > > > > >
> > >> > > > > > 1)    Regarding GPU, your point is clear and I agree with
> it,
> > >> ND4J
> > >> > > > looks
> > >> > > > > > appropriate. But, my current understanding is that, we also
> > >> need to
> > >> > > > cover
> > >> > > > > > some resource management questions -> when we need to
> provide
> > >> GPU
> > >> > > > support
> > >> > > > > > we also need to manage it like resource. For example, Mesos
> > has
> > >> > > already
> > >> > > > > > supported GPU like resource item: Initial support for GPU
> > >> > resources.
> > >> > > > > > <
> > >> > https://issues.apache.org/jira/browse/MESOS-4424?jql=text%20~%20GPU
> > >> > > >
> > >> > > > > > Flink
> > >> > > > > > uses Mesos as cluster manager, and this means that this
> > feature
> > >> of
> > >> > > > Mesos
> > >> > > > > > could be reused. Also memory managing questions in Flink
> > >> regarding
> > >> > > GPU
> > >> > > > > > should be clarified.
> > >> > > > > >
> > >> > > > > > 2)    Regarding integration with DL4J: what stops us to
> > >> initialize
> > >> > > > ticket
> > >> > > > > > and start the discussion around this topic? We need some
> user
> > >> story
> > >> > > or
> > >> > > > > the
> > >> > > > > > community is not sure that DL is really helpful? Why the
> > >> discussion
> > >> > > > with
> > >> > > > > > Adam
> > >> > > > > > Gibson just finished with no implementation of any idea?
> What
> > >> > > concerns
> > >> > > > do
> > >> > > > > > we have?
> > >> > > > > >
> > >> > > > > > пн, 6 февр. 2017 г. в 15:01, Theodore Vasiloudis <
> > >> > > > > > theodoros.vasiloudis@gmail.com>:
> > >> > > > > >
> > >> > > > > > > Hell all,
> > >> > > > > > >
> > >> > > > > > > This is point that has come up in the past: Given the
> > >> multitude
> > >> > of
> > >> > > ML
> > >> > > > > > > libraries out there, should we have native implementations
> > in
> > >> > > FlinkML
> > >> > > > > or
> > >> > > > > > > try to integrate other libraries instead?
> > >> > > > > > >
> > >> > > > > > > We haven't managed to reach a consensus on this before. My
> > >> > opinion
> > >> > > is
> > >> > > > > > that
> > >> > > > > > > there is definitely value in having ML algorithms written
> > >> > natively
> > >> > > in
> > >> > > > > > > Flink, both for performance optimization,
> > >> > > > > > > but more importantly for engineering simplicity, we don't
> > >> want to
> > >> > > > force
> > >> > > > > > > users to use yet another piece of software to run their ML
> > >> algos
> > >> > > (at
> > >> > > > > > least
> > >> > > > > > > for a basic set of algorithms).
> > >> > > > > > >
> > >> > > > > > > We have in the past  discussed integrations with DL4J
> > >> > (particularly
> > >> > > > > ND4J)
> > >> > > > > > > with Adam Gibson, the core developer of the library, but
> we
> > >> never
> > >> > > got
> > >> > > > > > > around to implementing anything.
> > >> > > > > > >
> > >> > > > > > > Whether it makes sense to have an integration with DL4J as
> > >> part
> > >> > of
> > >> > > > the
> > >> > > > > > > Flink distribution would be up for discussion. I would
> > >> suggest to
> > >> > > > make
> > >> > > > > it
> > >> > > > > > > an independent repo to allow for
> > >> > > > > > > faster dev/release cycles, and because it wouldn't be
> > directly
> > >> > > > related
> > >> > > > > to
> > >> > > > > > > the core of Flink so it would add extra reviewing burden
> to
> > an
> > >> > > > already
> > >> > > > > > > overloaded group of committers.
> > >> > > > > > >
> > >> > > > > > > Natively supporting GPU calculations in Flink would be
> much
> > >> > better
> > >> > > > > > achieved
> > >> > > > > > > through a library like ND4J, the engineering burden would
> be
> > >> too
> > >> > > much
> > >> > > > > > > otherwise.
> > >> > > > > > >
> > >> > > > > > > Regards,
> > >> > > > > > > Theodore
> > >> > > > > > >
> > >> > > > > > > On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <
> > >> > > > katherinmail@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hello, guys.
> > >> > > > > > > >
> > >> > > > > > > > Theodore, last week I started the review of the PR:
> > >> > > > > > > > https://github.com/apache/flink/pull/2735 related to
> > >> *word2Vec
> > >> > > for
> > >> > > > > > > Flink*.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > During this review I have asked myself: why do we need
> to
> > >> > > implement
> > >> > > > > > such
> > >> > > > > > > a
> > >> > > > > > > > very popular algorithm like *word2vec one more time*,
> when
> > >> > there
> > >> > > is
> > >> > > > > > > already
> > >> > > > > > > > available implementation in java provided by
> > >> > deeplearning4j.org
> > >> > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J ->
> > >> Apache
> > >> > 2
> > >> > > > > > > licence).
> > >> > > > > > > > This library tries to promote itself, there is a hype
> > >> around it
> > >> > > in
> > >> > > > ML
> > >> > > > > > > > sphere, and it was integrated with Apache Spark, to
> > provide
> > >> > > > scalable
> > >> > > > > > > > deeplearning calculations.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > *That's why I thought: could we integrate with this
> > library
> > >> or
> > >> > > not
> > >> > > > > also
> > >> > > > > > > and
> > >> > > > > > > > Flink? *
> > >> > > > > > > >
> > >> > > > > > > > 1) Personally I think, providing support and deployment
> of
> > >> > > > > > > > *Deeplearning(DL)
> > >> > > > > > > > algorithms/models in Flink* is promising and attractive
> > >> > feature,
> > >> > > > > > because:
> > >> > > > > > > >
> > >> > > > > > > >     a) during last two years DL proved its efficiency
> and
> > >> these
> > >> > > > > > > algorithms
> > >> > > > > > > > used in many applications. For example *Spotify *uses DL
> > >> based
> > >> > > > > > algorithms
> > >> > > > > > > > for music content extraction: Recommending music on
> > Spotify
> > >> > with
> > >> > > > deep
> > >> > > > > > > > learning AUGUST 05, 2014
> > >> > > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html>
> > for
> > >> > > their
> > >> > > > > > music
> > >> > > > > > > > recommendations. Developers need to scale up DL
> manually,
> > >> that
> > >> > > > causes
> > >> > > > > a
> > >> > > > > > > lot
> > >> > > > > > > > of work, so that’s why such platforms like Flink should
> > >> support
> > >> > > > these
> > >> > > > > > > > models deployment.
> > >> > > > > > > >
> > >> > > > > > > >     b) Here is presented the scope of Deeplearning usage
> > >> cases
> > >> > > > > > > > <https://deeplearning4j.org/use_cases>, so many of this
> > >> > > scenarios
> > >> > > > > > > related
> > >> > > > > > > > to scenarios, that could be supported on Flink.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > 2) But DL uncover such questions like:
> > >> > > > > > > >
> > >> > > > > > > >     a) scale up calculations over machines
> > >> > > > > > > >
> > >> > > > > > > >     b) perform these calculations both over CPU and GPU.
> > >> GPU is
> > >> > > > > > required
> > >> > > > > > > to
> > >> > > > > > > > train big DL models, otherwise learning process could
> have
> > >> very
> > >> > > > slow
> > >> > > > > > > > convergence.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > 3) I have checked this DL4J library, which already have
> > >> reach
> > >> > > > support
> > >> > > > > > of
> > >> > > > > > > > many attractive DL models like: Recurrent Networks and
> > >> LSTMs,
> > >> > > > > > > Convolutional
> > >> > > > > > > > Networks (CNN), Restricted Boltzmann Machines (RBM) and
> > >> others.
> > >> > > So
> > >> > > > we
> > >> > > > > > > won’t
> > >> > > > > > > > need to implement them independently, but only provide
> the
> > >> > > ability
> > >> > > > of
> > >> > > > > > > > execution of this models over Flink cluster, the quite
> > >> similar
> > >> > > way
> > >> > > > > like
> > >> > > > > > > it
> > >> > > > > > > > was integrated with Apache Spark.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > Because of all of this I propose:
> > >> > > > > > > >
> > >> > > > > > > > 1)    To create new ticket in Flink’s JIRA for
> integration
> > >> of
> > >> > > Flink
> > >> > > > > > with
> > >> > > > > > > > DL4J and decide on which side this integration should be
> > >> > > > implemented.
> > >> > > > > > > >
> > >> > > > > > > > 2)    Support natively GPU resources in Flink and allow
> > >> > > > calculations
> > >> > > > > > over
> > >> > > > > > > > them, like that is described in this publication
> > >> > > > > > > > https://www.oreilly.com/learning/accelerating-spark-
> > >> > > > > > workloads-using-gpus
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > *Regarding original issue Implement Word2Vec
> > >> > > > > > > > <https://issues.apache.org/jira/browse/FLINK-2094>in
> > Flink,
> > >> > *I
> > >> > > > have
> > >> > > > > > > > investigated its implementation in DL4J and  that
> > >> > implementation
> > >> > > of
> > >> > > > > > > > integration DL4J with Apache Spark, and got several
> > points:
> > >> > > > > > > >
> > >> > > > > > > > It seems that idea of building of our own implementation
> > of
> > >> > > > word2vec
> > >> > > > > in
> > >> > > > > > > > Flink not such a bad solution, because: This DL4J was
> > >> forced to
> > >> > > > > > > reimplement
> > >> > > > > > > > its original word2Vec over Spark. I have checked the
> > >> > integration
> > >> > > of
> > >> > > > > > DL4J
> > >> > > > > > > > with Spark, and found that it is too strongly coupled
> with
> > >> > Spark
> > >> > > > API,
> > >> > > > > > so
> > >> > > > > > > > that it is impossible just to take some DL4J API and
> reuse
> > >> it,
> > >> > > > > instead
> > >> > > > > > we
> > >> > > > > > > > need to implement independent integration for Flink.
> > >> > > > > > > >
> > >> > > > > > > > *That’s why we simply finish implementation of current
> PR
> > >> > > > > > > > **independently **from
> > >> > > > > > > > integration to DL4J.*
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > Could you please provide your opinion regarding my
> > questions
> > >> > and
> > >> > > > > > points,
> > >> > > > > > > > what do you think about them?
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > пн, 6 февр. 2017 г. в 12:51, Katherin Eri <
> > >> > > katherinmail@gmail.com
> > >> > > > >:
> > >> > > > > > > >
> > >> > > > > > > > > Sorry, guys I need to finish this letter first.
> > >> > > > > > > > >   Full version of it will come shortly.
> > >> > > > > > > > >
> > >> > > > > > > > > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <
> > >> > > > katherinmail@gmail.com
> > >> > > > > >:
> > >> > > > > > > > >
> > >> > > > > > > > > Hello, guys.
> > >> > > > > > > > > Theodore, last week I started the review of the PR:
> > >> > > > > > > > > https://github.com/apache/flink/pull/2735 related to
> > >> > *word2Vec
> > >> > > > for
> > >> > > > > > > > Flink*.
> > >> > > > > > > > >
> > >> > > > > > > > > During this review I have asked myself: why do we need
> > to
> > >> > > > implement
> > >> > > > > > > such
> > >> > > > > > > > a
> > >> > > > > > > > > very popular algorithm like *word2vec one more time*,
> > when
> > >> > > there
> > >> > > > is
> > >> > > > > > > > > already availabe implementation in java provided by
> > >> > > > > > deeplearning4j.org
> > >> > > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J
> ->
> > >> > Apache
> > >> > > 2
> > >> > > > > > > > licence).
> > >> > > > > > > > > This library tries to promote it self, there is a hype
> > >> around
> > >> > > it
> > >> > > > in
> > >> > > > > > ML
> > >> > > > > > > > > sphere, and  it was integrated with Apache Spark, to
> > >> provide
> > >> > > > > scalable
> > >> > > > > > > > > deeplearning calculations.
> > >> > > > > > > > > That's why I thought: could we integrate with this
> > >> library or
> > >> > > not
> > >> > > > > > also
> > >> > > > > > > > and
> > >> > > > > > > > > Flink?
> > >> > > > > > > > > 1) Personally I think, providing support and
> deployment
> > of
> > >> > > > > > Deeplearning
> > >> > > > > > > > > algorithms/models in Flink is promising and attractive
> > >> > feature,
> > >> > > > > > > because:
> > >> > > > > > > > >     a) during last two years deeplearning proved its
> > >> > efficiency
> > >> > > > and
> > >> > > > > > > this
> > >> > > > > > > > > algorithms used in many applications. For example
> > *Spotify
> > >> > > *uses
> > >> > > > DL
> > >> > > > > > > based
> > >> > > > > > > > > algorithms for music content extraction: Recommending
> > >> music
> > >> > on
> > >> > > > > > Spotify
> > >> > > > > > > > > with deep learning AUGUST 05, 2014
> > >> > > > > > > > > <http://benanne.github.io/
> 2014/08/05/spotify-cnns.html>
> > >> for
> > >> > > > their
> > >> > > > > > > music
> > >> > > > > > > > > recommendations. Doing this natively scalable is very
> > >> > > attractive.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > I have investigated that implementation of integration
> > >> DL4J
> > >> > > with
> > >> > > > > > Apache
> > >> > > > > > > > > Spark, and got several points:
> > >> > > > > > > > >
> > >> > > > > > > > > 1) It seems that idea of building of our own
> > >> implementation
> > >> > of
> > >> > > > > > word2vec
> > >> > > > > > > > > not such a bad solution, because the integration of
> DL4J
> > >> with
> > >> > > > Spark
> > >> > > > > > is
> > >> > > > > > > > too
> > >> > > > > > > > > strongly coupled with Saprk API and it will take time
> > from
> > >> > the
> > >> > > > side
> > >> > > > > > of
> > >> > > > > > > > DL4J
> > >> > > > > > > > > to adopt this integration to Flink. Also I have
> expected
> > >> that
> > >> > > we
> > >> > > > > will
> > >> > > > > > > be
> > >> > > > > > > > > able to call just some API, it is not such thing.
> > >> > > > > > > > > 2)
> > >> > > > > > > > >
> > >> > > > > > > > > https://deeplearning4j.org/use_cases
> > >> > > > > > > > > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> > >> > > > > > > > implementation-r-python/
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <
> > >> > > trohrmann@apache.org
> > >> > > > >:
> > >> > > > > > > > >
> > >> > > > > > > > > Hi Katherin,
> > >> > > > > > > > >
> > >> > > > > > > > > welcome to the Flink community. Always great to see
> new
> > >> > people
> > >> > > > > > joining
> > >> > > > > > > > the
> > >> > > > > > > > > community :-)
> > >> > > > > > > > >
> > >> > > > > > > > > Cheers,
> > >> > > > > > > > > Till
> > >> > > > > > > > >
> > >> > > > > > > > > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> > >> > > > > > > > katherinmail@gmail.com>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > ok, I've got it.
> > >> > > > > > > > > > I will take a look at
> > >> > > > https://github.com/apache/flink/pull/2735
> > >> > > > > .
> > >> > > > > > > > > >
> > >> > > > > > > > > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > >> > > > > > > > > > theodoros.vasiloudis@gmail.com>:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > Hello Katherin,
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Welcome to the Flink community!
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > The ML component definitely needs a lot of work
> you
> > >> are
> > >> > > > > correct,
> > >> > > > > > we
> > >> > > > > > > > are
> > >> > > > > > > > > > > facing similar problems to CEP, which we'll
> > hopefully
> > >> > > resolve
> > >> > > > > > with
> > >> > > > > > > > the
> > >> > > > > > > > > > > restructuring Stephan has mentioned in that
> thread.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > If you'd like to help out with PRs we have many
> > open,
> > >> > one I
> > >> > > > > have
> > >> > > > > > > > > started
> > >> > > > > > > > > > > reviewing but got side-tracked is the Word2Vec one
> > >> [1].
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Best,
> > >> > > > > > > > > > > Theodore
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > [1] https://github.com/apache/flink/pull/2735
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <
> > >> > > > > > fhueske@gmail.com
> > >> > > > > > > >
> > >> > > > > > > > > > wrote:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > > Hi Katherin,
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > welcome to the Flink community!
> > >> > > > > > > > > > > > Help with reviewing PRs is always very welcome
> > and a
> > >> > > great
> > >> > > > > way
> > >> > > > > > to
> > >> > > > > > > > > > > > contribute.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Best, Fabian
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> > >> > > > > > > > katherinmail@gmail.com
> > >> > > > > > > > > >:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > > Thank you, Timo.
> > >> > > > > > > > > > > > > I have started the analysis of the topic.
> > >> > > > > > > > > > > > > And if it necessary, I will try to perform the
> > >> review
> > >> > > of
> > >> > > > > > other
> > >> > > > > > > > > pulls)
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <
> > >> > > > > > twalthr@apache.org
> > >> > > > > > > >:
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Hi Katherin,
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > great to hear that you would like to
> > contribute!
> > >> > > > Welcome!
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > I gave you contributor permissions. You can
> > now
> > >> > > assign
> > >> > > > > > issues
> > >> > > > > > > > to
> > >> > > > > > > > > > > > > > yourself. I assigned FLINK-1750 to you.
> > >> > > > > > > > > > > > > > Right now there are many open ML pull
> > requests,
> > >> you
> > >> > > are
> > >> > > > > > very
> > >> > > > > > > > > > welcome
> > >> > > > > > > > > > > to
> > >> > > > > > > > > > > > > > review the code of others, too.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Timo
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Am 17/01/17 um 10:39 schrieb Katherin
> Sotenko:
> > >> > > > > > > > > > > > > > > Hello, All!
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > I'm Kate Eri, I'm java developer with
> 6-year
> > >> > > > enterprise
> > >> > > > > > > > > > experience,
> > >> > > > > > > > > > > > > also
> > >> > > > > > > > > > > > > > I
> > >> > > > > > > > > > > > > > > have some expertise with scala (half of
> the
> > >> > year).
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Last 2 years I have participated in
> several
> > >> > BigData
> > >> > > > > > > projects
> > >> > > > > > > > > that
> > >> > > > > > > > > > > > were
> > >> > > > > > > > > > > > > > > related to Machine Learning (Time series
> > >> > analysis,
> > >> > > > > > > > Recommender
> > >> > > > > > > > > > > > systems,
> > >> > > > > > > > > > > > > > > Social networking) and ETL. I have
> > experience
> > >> > with
> > >> > > > > > Hadoop,
> > >> > > > > > > > > Apache
> > >> > > > > > > > > > > > Spark
> > >> > > > > > > > > > > > > > and
> > >> > > > > > > > > > > > > > > Hive.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > I’m fond of ML topic, and I see that Flink
> > >> > project
> > >> > > > > > requires
> > >> > > > > > > > > some
> > >> > > > > > > > > > > work
> > >> > > > > > > > > > > > > in
> > >> > > > > > > > > > > > > > > this area, so that’s why I would like to
> > join
> > >> > Flink
> > >> > > > and
> > >> > > > > > ask
> > >> > > > > > > > me
> > >> > > > > > > > > to
> > >> > > > > > > > > > > > grant
> > >> > > > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > assignment of the ticket
> > >> > > > > > > > > > > > > > https://issues.apache.org/jira
> > >> /browse/FLINK-1750
> > >> > > > > > > > > > > > > > > to me.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
> >
>

Re: New Flink team member - Kate Eri.

Posted by Katherin Eri <ka...@gmail.com>.

Hello guys,



It seems that issue FLINK-1730
<https://issues.apache.org/jira/browse/FLINK-1730> significantly impacts
integration of Flink with SystemML.

They have checked several integrations, and Flink’s integration is slowest
<https://github.com/apache/incubator-systemml/pull/119#issuecomment-222059794>
:

   - MR: LinregDS: 147s (2 jobs); LinregCG w/ 6 iterations: 361s (8 jobs)
   w/ mmchain; 628s (14 jobs) w/o mmchain
   - Spark: LinregDS: 71s (3 jobs); LinregCG w/ 6 iterations: 41s (8 jobs)
   w/ mmchain; 48s (14 jobs) w/o mmchain
   - Flink: LinregDS: 212s (3 jobs); LinregCG w/ 6 iterations: 1,047s (14
   jobs) w/o mmchain

This fact is caused, as already Felix said, by two reasons:

1)      FLINK-1730 <https://issues.apache.org/jira/browse/FLINK-1730>

2)      FLINK-4175 <https://issues.apache.org/jira/browse/FLINK-4175>

As far as FLINK-1730 is not assigned to anyone we would like to take this
ticket to work (my colleges could try to implement it).

Further discussion of the topic related to FLINK-1730 I would like to
handle in appropriate ticket.


пт, 10 февр. 2017 г. в 19:57, Katherin Eri <ka...@gmail.com>:

> I have created the ticket to discuss GPU related questions futher
> https://issues.apache.org/jira/browse/FLINK-5782
>
> пт, 10 февр. 2017 г. в 18:16, Katherin Eri <ka...@gmail.com>:
>
> Thank you, Trevor!
>
> You have shared very valuable points; I will consider them.
>
> So I think, I should create finally ticket at Flink’s JIRA, at least for
> Flink's GPU support and move the related discussion there?
>
> I will contact to Suneel regarding DL4J, thanks!
>
>
> пт, 10 февр. 2017 г. в 17:44, Trevor Grant <tr...@gmail.com>:
>
> Also RE: DL4J integration.
>
> Suneel had done some work on this a while back, and ran into issues.  You
> might want to chat with him about the pitfalls and 'gotchyas' there.
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Fri, Feb 10, 2017 at 7:37 AM, Trevor Grant <tr...@gmail.com>
> wrote:
>
> > Sorry for chiming in late.
> >
> > GPUs on Flink.  Till raised a good point- you need to be able to fall
> back
> > to non-GPU resources if they aren't available.
> >
> > Fun fact: this has already been developed for Flink vis-a-vis the Apache
> > Mahout project.
> >
> > In short- Mahout exposes a number of tensor functions (vector %*% matrix,
> > matrix %*% matrix, etc).  If compiled for GPU support, those operations
> are
> > completed via GPU- and if no GPUs are in fact available, Mahout math
> falls
> > back to CPUs (and finally back to the JVM).
> >
> > How this should work is Flink takes care of shipping data around the
> > cluster, and when data arrives at the local node- is dumped out to GPU
> for
> > calculation, loaded back up and shipped back around cluster.  In
> practice,
> > the lack of a persist method for intermediate results makes this
> > troublesome (not because of GPUs but for calculating any sort of complex
> > algorithm we expect to be able to cache intermediate results).
> >
> > +1 to FLINK-1730
> >
> > Everything in Mahout is modular- distributed engine
> > (Flink/Spark/Write-your-own), Native Solvers (OpenMP / ViennaCL / CUDA /
> > Write-your-own), algorithms, etc.
> >
> > So to sum up, you're noting the redundancy between ML packages in terms
> of
> > algorithms- I would recommend checking out Mahout before rolling your own
> > GPU integration (else risk redundantly integrating GPUs). If nothing
> else-
> > it should give you some valuable insight regarding design considerations.
> > Also FYI the goal of the Apache Mahout project is to address that problem
> > precisely- implement an algorithm once in a mathematically expressive
> DSL,
> > which is abstracted above the engine so the same code easily ports
> between
> > engines / native solvers (i.e. CPU/GPU).
> >
> > https://github.com/apache/mahout/tree/master/viennacl-omp
> > https://github.com/apache/mahout/tree/master/viennacl
> >
> > Best,
> > tg
> >
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Fri, Feb 10, 2017 at 7:01 AM, Katherin Eri <ka...@gmail.com>
> > wrote:
> >
> >> Thank you Felix, for provided information.
> >>
> >> Currently I analyze the provided integration of Flink with SystemML.
> >>
> >> And also gather the information for the ticket  FLINK-1730
> >> <https://issues.apache.org/jira/browse/FLINK-1730>, maybe we will take
> it
> >> to work, to unlock SystemML/Flink integration.
> >>
> >>
> >>
> >> чт, 9 февр. 2017 г. в 0:17, Felix Neutatz <neutatz@googlemail.com.invali
> >> d>:
> >>
> >> > Hi Kate,
> >> >
> >> > 1) - Broadcast:
> >> >
> >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-5%3A+
> >> Only+send+data+to+each+taskmanager+once+for+broadcasts
> >> >  - Caching: https://issues.apache.org/jira/browse/FLINK-1730
> >> >
> >> > 2) I have no idea about the GPU implementation. The SystemML mailing
> >> list
> >> > will probably help you out their.
> >> >
> >> > Best regards,
> >> > Felix
> >> >
> >> > 2017-02-08 14:33 GMT+01:00 Katherin Eri <ka...@gmail.com>:
> >> >
> >> > > Thank you Felix, for your point, it is quite interesting.
> >> > >
> >> > > I will take a look at the code, of the provided Flink integration.
> >> > >
> >> > > 1)    You have these problems with Flink: >>we realized that the
> lack
> >> of
> >> > a
> >> > > caching operator and a broadcast issue highly effects the
> performance,
> >> > have
> >> > > you already asked about this the community? In case yes: please
> >> provide
> >> > the
> >> > > reference to the ticket or the topic of letter.
> >> > >
> >> > > 2)    You have said, that SystemML provides GPU support. I have seen
> >> > > SystemML’s source code and would like to ask: why you have decided
> to
> >> > > implement your own integration with cuda? Did you try to consider
> >> ND4J,
> >> > or
> >> > > because it is younger, you support your own implementation?
> >> > >
> >> > > вт, 7 февр. 2017 г. в 18:35, Felix Neutatz <neutatz@googlemail.com
> >:
> >> > >
> >> > > > Hi Katherin,
> >> > > >
> >> > > > we are also working in a similar direction. We implemented a
> >> prototype
> >> > to
> >> > > > integrate with SystemML:
> >> > > > https://github.com/apache/incubator-systemml/pull/119
> >> > > > SystemML provides many different matrix formats, operations, GPU
> >> > support
> >> > > > and a couple of DL algorithms. Unfortunately, we realized that the
> >> lack
> >> > > of
> >> > > > a caching operator and a broadcast issue highly effects the
> >> performance
> >> > > > (e.g. compared to Spark). At the moment I am trying to tackle the
> >> > > broadcast
> >> > > > issue. But caching is still a problem for us.
> >> > > >
> >> > > > Best regards,
> >> > > > Felix
> >> > > >
> >> > > > 2017-02-07 16:22 GMT+01:00 Katherin Eri <ka...@gmail.com>:
> >> > > >
> >> > > > > Thank you, Till.
> >> > > > >
> >> > > > > 1)      Regarding ND4J, I didn’t know about such a pity and
> >> critical
> >> > > > > restriction of it -> lack of sparsity optimizations, and you are
> >> > right:
> >> > > > > this issue is still actual for them. I saw that Flink uses
> Breeze,
> >> > but
> >> > > I
> >> > > > > thought its usage caused by some historical reasons.
> >> > > > >
> >> > > > > 2)      Regarding integration with DL4J, I have read the source
> >> code
> >> > of
> >> > > > > DL4J/Spark integration, that’s why I have declined my idea of
> >> reuse
> >> > of
> >> > > > > their word2vec implementation for now, for example. I can
> perform
> >> > > deeper
> >> > > > > investigation of this topic, if it required.
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > So I feel that we have the following picture:
> >> > > > >
> >> > > > > 1)      DL integration investigation, could be part of Apache
> >> Bahir.
> >> > I
> >> > > > can
> >> > > > > perform futher investigation of this topic, but I thik we need
> >> some
> >> > > > > separated ticket for this to track this activity.
> >> > > > >
> >> > > > > 2)      GPU support, required for DL is interesting, but
> requires
> >> > ND4J
> >> > > > for
> >> > > > > example.
> >> > > > >
> >> > > > > 3)      ND4J couldn’t be incorporated because it doesn’t support
> >> > > sparsity
> >> > > > > <https://deeplearning4j.org/roadmap.html> [1].
> >> > > > >
> >> > > > > Regarding ND4J is this the single blocker for incorporation of
> it
> >> or
> >> > > may
> >> > > > be
> >> > > > > some others known?
> >> > > > >
> >> > > > >
> >> > > > > [1] https://deeplearning4j.org/roadmap.html
> >> > > > >
> >> > > > > вт, 7 февр. 2017 г. в 16:26, Till Rohrmann <
> trohrmann@apache.org
> >> >:
> >> > > > >
> >> > > > > Thanks for initiating this discussion Katherin. I think you're
> >> right
> >> > > that
> >> > > > > in general it does not make sense to reinvent the wheel over and
> >> over
> >> > > > > again. Especially if you only have limited resources at hand. So
> >> if
> >> > we
> >> > > > > could integrate Flink with some existing library that would be
> >> great.
> >> > > > >
> >> > > > > In the past, however, we couldn't find a good library which
> >> provided
> >> > > > enough
> >> > > > > freedom to integrate it with Flink. Especially if you want to
> have
> >> > > > > distributed and somewhat high-performance implementations of ML
> >> > > > algorithms
> >> > > > > you would have to take Flink's execution model (capabilities as
> >> well
> >> > as
> >> > > > > limitations) into account. That is mainly the reason why we
> >> started
> >> > > > > implementing some of the algorithms "natively" on Flink.
> >> > > > >
> >> > > > > If I remember correctly, then the problem with ND4J was and
> still
> >> is
> >> > > that
> >> > > > > it does not support sparse matrices which was a requirement from
> >> our
> >> > > > side.
> >> > > > > As far as I know, it is quite common that you have sparse data
> >> > > structures
> >> > > > > when dealing with large scale problems. That's why we built our
> >> own
> >> > > > > abstraction which can have different implementations. Currently,
> >> the
> >> > > > > default implementation uses Breeze.
> >> > > > >
> >> > > > > I think the support for GPU based operations and the actual
> >> resource
> >> > > > > management are two orthogonal things. The implementation would
> >> have
> >> > to
> >> > > > work
> >> > > > > with no GPUs available anyway. If the system detects that GPUs
> are
> >> > > > > available, then ideally it would exploit them. Thus, we could
> add
> >> > this
> >> > > > > feature later and maybe integrate it with FLINK-5131 [1].
> >> > > > >
> >> > > > > Concerning the integration with DL4J I think that Theo's
> proposal
> >> to
> >> > do
> >> > > > it
> >> > > > > in a separate repository (maybe as part of Apache Bahir) is a
> good
> >> > > idea.
> >> > > > > We're currently thinking about outsourcing some of Flink's
> >> libraries
> >> > > into
> >> > > > > sub projects. This could also be an option for the DL4J
> >> integration
> >> > > then.
> >> > > > > In general I think it should be feasible to run DL4J on Flink
> >> given
> >> > > that
> >> > > > it
> >> > > > > also runs on Spark. Have you already looked at it closer?
> >> > > > >
> >> > > > > [1] https://issues.apache.org/jira/browse/FLINK-5131
> >> > > > >
> >> > > > > Cheers,
> >> > > > > Till
> >> > > > >
> >> > > > > On Tue, Feb 7, 2017 at 11:47 AM, Katherin Eri <
> >> > katherinmail@gmail.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Thank you Theodore, for your reply.
> >> > > > > >
> >> > > > > > 1)    Regarding GPU, your point is clear and I agree with it,
> >> ND4J
> >> > > > looks
> >> > > > > > appropriate. But, my current understanding is that, we also
> >> need to
> >> > > > cover
> >> > > > > > some resource management questions -> when we need to provide
> >> GPU
> >> > > > support
> >> > > > > > we also need to manage it like resource. For example, Mesos
> has
> >> > > already
> >> > > > > > supported GPU like resource item: Initial support for GPU
> >> > resources.
> >> > > > > > <
> >> > https://issues.apache.org/jira/browse/MESOS-4424?jql=text%20~%20GPU
> >> > > >
> >> > > > > > Flink
> >> > > > > > uses Mesos as cluster manager, and this means that this
> feature
> >> of
> >> > > > Mesos
> >> > > > > > could be reused. Also memory managing questions in Flink
> >> regarding
> >> > > GPU
> >> > > > > > should be clarified.
> >> > > > > >
> >> > > > > > 2)    Regarding integration with DL4J: what stops us to
> >> initialize
> >> > > > ticket
> >> > > > > > and start the discussion around this topic? We need some user
> >> story
> >> > > or
> >> > > > > the
> >> > > > > > community is not sure that DL is really helpful? Why the
> >> discussion
> >> > > > with
> >> > > > > > Adam
> >> > > > > > Gibson just finished with no implementation of any idea? What
> >> > > concerns
> >> > > > do
> >> > > > > > we have?
> >> > > > > >
> >> > > > > > пн, 6 февр. 2017 г. в 15:01, Theodore Vasiloudis <
> >> > > > > > theodoros.vasiloudis@gmail.com>:
> >> > > > > >
> >> > > > > > > Hell all,
> >> > > > > > >
> >> > > > > > > This is point that has come up in the past: Given the
> >> multitude
> >> > of
> >> > > ML
> >> > > > > > > libraries out there, should we have native implementations
> in
> >> > > FlinkML
> >> > > > > or
> >> > > > > > > try to integrate other libraries instead?
> >> > > > > > >
> >> > > > > > > We haven't managed to reach a consensus on this before. My
> >> > opinion
> >> > > is
> >> > > > > > that
> >> > > > > > > there is definitely value in having ML algorithms written
> >> > natively
> >> > > in
> >> > > > > > > Flink, both for performance optimization,
> >> > > > > > > but more importantly for engineering simplicity, we don't
> >> want to
> >> > > > force
> >> > > > > > > users to use yet another piece of software to run their ML
> >> algos
> >> > > (at
> >> > > > > > least
> >> > > > > > > for a basic set of algorithms).
> >> > > > > > >
> >> > > > > > > We have in the past  discussed integrations with DL4J
> >> > (particularly
> >> > > > > ND4J)
> >> > > > > > > with Adam Gibson, the core developer of the library, but we
> >> never
> >> > > got
> >> > > > > > > around to implementing anything.
> >> > > > > > >
> >> > > > > > > Whether it makes sense to have an integration with DL4J as
> >> part
> >> > of
> >> > > > the
> >> > > > > > > Flink distribution would be up for discussion. I would
> >> suggest to
> >> > > > make
> >> > > > > it
> >> > > > > > > an independent repo to allow for
> >> > > > > > > faster dev/release cycles, and because it wouldn't be
> directly
> >> > > > related
> >> > > > > to
> >> > > > > > > the core of Flink so it would add extra reviewing burden to
> an
> >> > > > already
> >> > > > > > > overloaded group of committers.
> >> > > > > > >
> >> > > > > > > Natively supporting GPU calculations in Flink would be much
> >> > better
> >> > > > > > achieved
> >> > > > > > > through a library like ND4J, the engineering burden would be
> >> too
> >> > > much
> >> > > > > > > otherwise.
> >> > > > > > >
> >> > > > > > > Regards,
> >> > > > > > > Theodore
> >> > > > > > >
> >> > > > > > > On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <
> >> > > > katherinmail@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hello, guys.
> >> > > > > > > >
> >> > > > > > > > Theodore, last week I started the review of the PR:
> >> > > > > > > > https://github.com/apache/flink/pull/2735 related to
> >> *word2Vec
> >> > > for
> >> > > > > > > Flink*.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > During this review I have asked myself: why do we need to
> >> > > implement
> >> > > > > > such
> >> > > > > > > a
> >> > > > > > > > very popular algorithm like *word2vec one more time*, when
> >> > there
> >> > > is
> >> > > > > > > already
> >> > > > > > > > available implementation in java provided by
> >> > deeplearning4j.org
> >> > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J ->
> >> Apache
> >> > 2
> >> > > > > > > licence).
> >> > > > > > > > This library tries to promote itself, there is a hype
> >> around it
> >> > > in
> >> > > > ML
> >> > > > > > > > sphere, and it was integrated with Apache Spark, to
> provide
> >> > > > scalable
> >> > > > > > > > deeplearning calculations.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > *That's why I thought: could we integrate with this
> library
> >> or
> >> > > not
> >> > > > > also
> >> > > > > > > and
> >> > > > > > > > Flink? *
> >> > > > > > > >
> >> > > > > > > > 1) Personally I think, providing support and deployment of
> >> > > > > > > > *Deeplearning(DL)
> >> > > > > > > > algorithms/models in Flink* is promising and attractive
> >> > feature,
> >> > > > > > because:
> >> > > > > > > >
> >> > > > > > > >     a) during last two years DL proved its efficiency and
> >> these
> >> > > > > > > algorithms
> >> > > > > > > > used in many applications. For example *Spotify *uses DL
> >> based
> >> > > > > > algorithms
> >> > > > > > > > for music content extraction: Recommending music on
> Spotify
> >> > with
> >> > > > deep
> >> > > > > > > > learning AUGUST 05, 2014
> >> > > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html>
> for
> >> > > their
> >> > > > > > music
> >> > > > > > > > recommendations. Developers need to scale up DL manually,
> >> that
> >> > > > causes
> >> > > > > a
> >> > > > > > > lot
> >> > > > > > > > of work, so that’s why such platforms like Flink should
> >> support
> >> > > > these
> >> > > > > > > > models deployment.
> >> > > > > > > >
> >> > > > > > > >     b) Here is presented the scope of Deeplearning usage
> >> cases
> >> > > > > > > > <https://deeplearning4j.org/use_cases>, so many of this
> >> > > scenarios
> >> > > > > > > related
> >> > > > > > > > to scenarios, that could be supported on Flink.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > 2) But DL uncover such questions like:
> >> > > > > > > >
> >> > > > > > > >     a) scale up calculations over machines
> >> > > > > > > >
> >> > > > > > > >     b) perform these calculations both over CPU and GPU.
> >> GPU is
> >> > > > > > required
> >> > > > > > > to
> >> > > > > > > > train big DL models, otherwise learning process could have
> >> very
> >> > > > slow
> >> > > > > > > > convergence.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > 3) I have checked this DL4J library, which already have
> >> reach
> >> > > > support
> >> > > > > > of
> >> > > > > > > > many attractive DL models like: Recurrent Networks and
> >> LSTMs,
> >> > > > > > > Convolutional
> >> > > > > > > > Networks (CNN), Restricted Boltzmann Machines (RBM) and
> >> others.
> >> > > So
> >> > > > we
> >> > > > > > > won’t
> >> > > > > > > > need to implement them independently, but only provide the
> >> > > ability
> >> > > > of
> >> > > > > > > > execution of this models over Flink cluster, the quite
> >> similar
> >> > > way
> >> > > > > like
> >> > > > > > > it
> >> > > > > > > > was integrated with Apache Spark.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Because of all of this I propose:
> >> > > > > > > >
> >> > > > > > > > 1)    To create new ticket in Flink’s JIRA for integration
> >> of
> >> > > Flink
> >> > > > > > with
> >> > > > > > > > DL4J and decide on which side this integration should be
> >> > > > implemented.
> >> > > > > > > >
> >> > > > > > > > 2)    Support natively GPU resources in Flink and allow
> >> > > > calculations
> >> > > > > > over
> >> > > > > > > > them, like that is described in this publication
> >> > > > > > > > https://www.oreilly.com/learning/accelerating-spark-
> >> > > > > > workloads-using-gpus
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > *Regarding original issue Implement Word2Vec
> >> > > > > > > > <https://issues.apache.org/jira/browse/FLINK-2094>in
> Flink,
> >> > *I
> >> > > > have
> >> > > > > > > > investigated its implementation in DL4J and  that
> >> > implementation
> >> > > of
> >> > > > > > > > integration DL4J with Apache Spark, and got several
> points:
> >> > > > > > > >
> >> > > > > > > > It seems that idea of building of our own implementation
> of
> >> > > > word2vec
> >> > > > > in
> >> > > > > > > > Flink not such a bad solution, because: This DL4J was
> >> forced to
> >> > > > > > > reimplement
> >> > > > > > > > its original word2Vec over Spark. I have checked the
> >> > integration
> >> > > of
> >> > > > > > DL4J
> >> > > > > > > > with Spark, and found that it is too strongly coupled with
> >> > Spark
> >> > > > API,
> >> > > > > > so
> >> > > > > > > > that it is impossible just to take some DL4J API and reuse
> >> it,
> >> > > > > instead
> >> > > > > > we
> >> > > > > > > > need to implement independent integration for Flink.
> >> > > > > > > >
> >> > > > > > > > *That’s why we simply finish implementation of current PR
> >> > > > > > > > **independently **from
> >> > > > > > > > integration to DL4J.*
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Could you please provide your opinion regarding my
> questions
> >> > and
> >> > > > > > points,
> >> > > > > > > > what do you think about them?
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > пн, 6 февр. 2017 г. в 12:51, Katherin Eri <
> >> > > katherinmail@gmail.com
> >> > > > >:
> >> > > > > > > >
> >> > > > > > > > > Sorry, guys I need to finish this letter first.
> >> > > > > > > > >   Full version of it will come shortly.
> >> > > > > > > > >
> >> > > > > > > > > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <
> >> > > > katherinmail@gmail.com
> >> > > > > >:
> >> > > > > > > > >
> >> > > > > > > > > Hello, guys.
> >> > > > > > > > > Theodore, last week I started the review of the PR:
> >> > > > > > > > > https://github.com/apache/flink/pull/2735 related to
> >> > *word2Vec
> >> > > > for
> >> > > > > > > > Flink*.
> >> > > > > > > > >
> >> > > > > > > > > During this review I have asked myself: why do we need
> to
> >> > > > implement
> >> > > > > > > such
> >> > > > > > > > a
> >> > > > > > > > > very popular algorithm like *word2vec one more time*,
> when
> >> > > there
> >> > > > is
> >> > > > > > > > > already availabe implementation in java provided by
> >> > > > > > deeplearning4j.org
> >> > > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J ->
> >> > Apache
> >> > > 2
> >> > > > > > > > licence).
> >> > > > > > > > > This library tries to promote it self, there is a hype
> >> around
> >> > > it
> >> > > > in
> >> > > > > > ML
> >> > > > > > > > > sphere, and  it was integrated with Apache Spark, to
> >> provide
> >> > > > > scalable
> >> > > > > > > > > deeplearning calculations.
> >> > > > > > > > > That's why I thought: could we integrate with this
> >> library or
> >> > > not
> >> > > > > > also
> >> > > > > > > > and
> >> > > > > > > > > Flink?
> >> > > > > > > > > 1) Personally I think, providing support and deployment
> of
> >> > > > > > Deeplearning
> >> > > > > > > > > algorithms/models in Flink is promising and attractive
> >> > feature,
> >> > > > > > > because:
> >> > > > > > > > >     a) during last two years deeplearning proved its
> >> > efficiency
> >> > > > and
> >> > > > > > > this
> >> > > > > > > > > algorithms used in many applications. For example
> *Spotify
> >> > > *uses
> >> > > > DL
> >> > > > > > > based
> >> > > > > > > > > algorithms for music content extraction: Recommending
> >> music
> >> > on
> >> > > > > > Spotify
> >> > > > > > > > > with deep learning AUGUST 05, 2014
> >> > > > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html>
> >> for
> >> > > > their
> >> > > > > > > music
> >> > > > > > > > > recommendations. Doing this natively scalable is very
> >> > > attractive.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > I have investigated that implementation of integration
> >> DL4J
> >> > > with
> >> > > > > > Apache
> >> > > > > > > > > Spark, and got several points:
> >> > > > > > > > >
> >> > > > > > > > > 1) It seems that idea of building of our own
> >> implementation
> >> > of
> >> > > > > > word2vec
> >> > > > > > > > > not such a bad solution, because the integration of DL4J
> >> with
> >> > > > Spark
> >> > > > > > is
> >> > > > > > > > too
> >> > > > > > > > > strongly coupled with Saprk API and it will take time
> from
> >> > the
> >> > > > side
> >> > > > > > of
> >> > > > > > > > DL4J
> >> > > > > > > > > to adopt this integration to Flink. Also I have expected
> >> that
> >> > > we
> >> > > > > will
> >> > > > > > > be
> >> > > > > > > > > able to call just some API, it is not such thing.
> >> > > > > > > > > 2)
> >> > > > > > > > >
> >> > > > > > > > > https://deeplearning4j.org/use_cases
> >> > > > > > > > > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> >> > > > > > > > implementation-r-python/
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <
> >> > > trohrmann@apache.org
> >> > > > >:
> >> > > > > > > > >
> >> > > > > > > > > Hi Katherin,
> >> > > > > > > > >
> >> > > > > > > > > welcome to the Flink community. Always great to see new
> >> > people
> >> > > > > > joining
> >> > > > > > > > the
> >> > > > > > > > > community :-)
> >> > > > > > > > >
> >> > > > > > > > > Cheers,
> >> > > > > > > > > Till
> >> > > > > > > > >
> >> > > > > > > > > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> >> > > > > > > > katherinmail@gmail.com>
> >> > > > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > ok, I've got it.
> >> > > > > > > > > > I will take a look at
> >> > > > https://github.com/apache/flink/pull/2735
> >> > > > > .
> >> > > > > > > > > >
> >> > > > > > > > > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> >> > > > > > > > > > theodoros.vasiloudis@gmail.com>:
> >> > > > > > > > > >
> >> > > > > > > > > > > Hello Katherin,
> >> > > > > > > > > > >
> >> > > > > > > > > > > Welcome to the Flink community!
> >> > > > > > > > > > >
> >> > > > > > > > > > > The ML component definitely needs a lot of work you
> >> are
> >> > > > > correct,
> >> > > > > > we
> >> > > > > > > > are
> >> > > > > > > > > > > facing similar problems to CEP, which we'll
> hopefully
> >> > > resolve
> >> > > > > > with
> >> > > > > > > > the
> >> > > > > > > > > > > restructuring Stephan has mentioned in that thread.
> >> > > > > > > > > > >
> >> > > > > > > > > > > If you'd like to help out with PRs we have many
> open,
> >> > one I
> >> > > > > have
> >> > > > > > > > > started
> >> > > > > > > > > > > reviewing but got side-tracked is the Word2Vec one
> >> [1].
> >> > > > > > > > > > >
> >> > > > > > > > > > > Best,
> >> > > > > > > > > > > Theodore
> >> > > > > > > > > > >
> >> > > > > > > > > > > [1] https://github.com/apache/flink/pull/2735
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <
> >> > > > > > fhueske@gmail.com
> >> > > > > > > >
> >> > > > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Hi Katherin,
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > welcome to the Flink community!
> >> > > > > > > > > > > > Help with reviewing PRs is always very welcome
> and a
> >> > > great
> >> > > > > way
> >> > > > > > to
> >> > > > > > > > > > > > contribute.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Best, Fabian
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> >> > > > > > > > katherinmail@gmail.com
> >> > > > > > > > > >:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > Thank you, Timo.
> >> > > > > > > > > > > > > I have started the analysis of the topic.
> >> > > > > > > > > > > > > And if it necessary, I will try to perform the
> >> review
> >> > > of
> >> > > > > > other
> >> > > > > > > > > pulls)
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <
> >> > > > > > twalthr@apache.org
> >> > > > > > > >:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Hi Katherin,
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > great to hear that you would like to
> contribute!
> >> > > > Welcome!
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > I gave you contributor permissions. You can
> now
> >> > > assign
> >> > > > > > issues
> >> > > > > > > > to
> >> > > > > > > > > > > > > > yourself. I assigned FLINK-1750 to you.
> >> > > > > > > > > > > > > > Right now there are many open ML pull
> requests,
> >> you
> >> > > are
> >> > > > > > very
> >> > > > > > > > > > welcome
> >> > > > > > > > > > > to
> >> > > > > > > > > > > > > > review the code of others, too.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Timo
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> >> > > > > > > > > > > > > > > Hello, All!
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > I'm Kate Eri, I'm java developer with 6-year
> >> > > > enterprise
> >> > > > > > > > > > experience,
> >> > > > > > > > > > > > > also
> >> > > > > > > > > > > > > > I
> >> > > > > > > > > > > > > > > have some expertise with scala (half of the
> >> > year).
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Last 2 years I have participated in several
> >> > BigData
> >> > > > > > > projects
> >> > > > > > > > > that
> >> > > > > > > > > > > > were
> >> > > > > > > > > > > > > > > related to Machine Learning (Time series
> >> > analysis,
> >> > > > > > > > Recommender
> >> > > > > > > > > > > > systems,
> >> > > > > > > > > > > > > > > Social networking) and ETL. I have
> experience
> >> > with
> >> > > > > > Hadoop,
> >> > > > > > > > > Apache
> >> > > > > > > > > > > > Spark
> >> > > > > > > > > > > > > > and
> >> > > > > > > > > > > > > > > Hive.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > I’m fond of ML topic, and I see that Flink
> >> > project
> >> > > > > > requires
> >> > > > > > > > > some
> >> > > > > > > > > > > work
> >> > > > > > > > > > > > > in
> >> > > > > > > > > > > > > > > this area, so that’s why I would like to
> join
> >> > Flink
> >> > > > and
> >> > > > > > ask
> >> > > > > > > > me
> >> > > > > > > > > to
> >> > > > > > > > > > > > grant
> >> > > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > assignment of the ticket
> >> > > > > > > > > > > > > > https://issues.apache.org/jira
> >> /browse/FLINK-1750
> >> > > > > > > > > > > > > > > to me.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>
>

Re: New Flink team member - Kate Eri.

Posted by Katherin Eri <ka...@gmail.com>.

I have created the ticket to discuss GPU related questions futher
https://issues.apache.org/jira/browse/FLINK-5782

пт, 10 февр. 2017 г. в 18:16, Katherin Eri <ka...@gmail.com>:

> Thank you, Trevor!
>
> You have shared very valuable points; I will consider them.
>
> So I think, I should create finally ticket at Flink’s JIRA, at least for
> Flink's GPU support and move the related discussion there?
>
> I will contact to Suneel regarding DL4J, thanks!
>
>
> пт, 10 февр. 2017 г. в 17:44, Trevor Grant <tr...@gmail.com>:
>
> Also RE: DL4J integration.
>
> Suneel had done some work on this a while back, and ran into issues.  You
> might want to chat with him about the pitfalls and 'gotchyas' there.
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Fri, Feb 10, 2017 at 7:37 AM, Trevor Grant <tr...@gmail.com>
> wrote:
>
> > Sorry for chiming in late.
> >
> > GPUs on Flink.  Till raised a good point- you need to be able to fall
> back
> > to non-GPU resources if they aren't available.
> >
> > Fun fact: this has already been developed for Flink vis-a-vis the Apache
> > Mahout project.
> >
> > In short- Mahout exposes a number of tensor functions (vector %*% matrix,
> > matrix %*% matrix, etc).  If compiled for GPU support, those operations
> are
> > completed via GPU- and if no GPUs are in fact available, Mahout math
> falls
> > back to CPUs (and finally back to the JVM).
> >
> > How this should work is Flink takes care of shipping data around the
> > cluster, and when data arrives at the local node- is dumped out to GPU
> for
> > calculation, loaded back up and shipped back around cluster.  In
> practice,
> > the lack of a persist method for intermediate results makes this
> > troublesome (not because of GPUs but for calculating any sort of complex
> > algorithm we expect to be able to cache intermediate results).
> >
> > +1 to FLINK-1730
> >
> > Everything in Mahout is modular- distributed engine
> > (Flink/Spark/Write-your-own), Native Solvers (OpenMP / ViennaCL / CUDA /
> > Write-your-own), algorithms, etc.
> >
> > So to sum up, you're noting the redundancy between ML packages in terms
> of
> > algorithms- I would recommend checking out Mahout before rolling your own
> > GPU integration (else risk redundantly integrating GPUs). If nothing
> else-
> > it should give you some valuable insight regarding design considerations.
> > Also FYI the goal of the Apache Mahout project is to address that problem
> > precisely- implement an algorithm once in a mathematically expressive
> DSL,
> > which is abstracted above the engine so the same code easily ports
> between
> > engines / native solvers (i.e. CPU/GPU).
> >
> > https://github.com/apache/mahout/tree/master/viennacl-omp
> > https://github.com/apache/mahout/tree/master/viennacl
> >
> > Best,
> > tg
> >
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Fri, Feb 10, 2017 at 7:01 AM, Katherin Eri <ka...@gmail.com>
> > wrote:
> >
> >> Thank you Felix, for provided information.
> >>
> >> Currently I analyze the provided integration of Flink with SystemML.
> >>
> >> And also gather the information for the ticket  FLINK-1730
> >> <https://issues.apache.org/jira/browse/FLINK-1730>, maybe we will take
> it
> >> to work, to unlock SystemML/Flink integration.
> >>
> >>
> >>
> >> чт, 9 февр. 2017 г. в 0:17, Felix Neutatz <neutatz@googlemail.com.invali
> >> d>:
> >>
> >> > Hi Kate,
> >> >
> >> > 1) - Broadcast:
> >> >
> >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-5%3A+
> >> Only+send+data+to+each+taskmanager+once+for+broadcasts
> >> >  - Caching: https://issues.apache.org/jira/browse/FLINK-1730
> >> >
> >> > 2) I have no idea about the GPU implementation. The SystemML mailing
> >> list
> >> > will probably help you out their.
> >> >
> >> > Best regards,
> >> > Felix
> >> >
> >> > 2017-02-08 14:33 GMT+01:00 Katherin Eri <ka...@gmail.com>:
> >> >
> >> > > Thank you Felix, for your point, it is quite interesting.
> >> > >
> >> > > I will take a look at the code, of the provided Flink integration.
> >> > >
> >> > > 1)    You have these problems with Flink: >>we realized that the
> lack
> >> of
> >> > a
> >> > > caching operator and a broadcast issue highly effects the
> performance,
> >> > have
> >> > > you already asked about this the community? In case yes: please
> >> provide
> >> > the
> >> > > reference to the ticket or the topic of letter.
> >> > >
> >> > > 2)    You have said, that SystemML provides GPU support. I have seen
> >> > > SystemML’s source code and would like to ask: why you have decided
> to
> >> > > implement your own integration with cuda? Did you try to consider
> >> ND4J,
> >> > or
> >> > > because it is younger, you support your own implementation?
> >> > >
> >> > > вт, 7 февр. 2017 г. в 18:35, Felix Neutatz <neutatz@googlemail.com
> >:
> >> > >
> >> > > > Hi Katherin,
> >> > > >
> >> > > > we are also working in a similar direction. We implemented a
> >> prototype
> >> > to
> >> > > > integrate with SystemML:
> >> > > > https://github.com/apache/incubator-systemml/pull/119
> >> > > > SystemML provides many different matrix formats, operations, GPU
> >> > support
> >> > > > and a couple of DL algorithms. Unfortunately, we realized that the
> >> lack
> >> > > of
> >> > > > a caching operator and a broadcast issue highly effects the
> >> performance
> >> > > > (e.g. compared to Spark). At the moment I am trying to tackle the
> >> > > broadcast
> >> > > > issue. But caching is still a problem for us.
> >> > > >
> >> > > > Best regards,
> >> > > > Felix
> >> > > >
> >> > > > 2017-02-07 16:22 GMT+01:00 Katherin Eri <ka...@gmail.com>:
> >> > > >
> >> > > > > Thank you, Till.
> >> > > > >
> >> > > > > 1)      Regarding ND4J, I didn’t know about such a pity and
> >> critical
> >> > > > > restriction of it -> lack of sparsity optimizations, and you are
> >> > right:
> >> > > > > this issue is still actual for them. I saw that Flink uses
> Breeze,
> >> > but
> >> > > I
> >> > > > > thought its usage caused by some historical reasons.
> >> > > > >
> >> > > > > 2)      Regarding integration with DL4J, I have read the source
> >> code
> >> > of
> >> > > > > DL4J/Spark integration, that’s why I have declined my idea of
> >> reuse
> >> > of
> >> > > > > their word2vec implementation for now, for example. I can
> perform
> >> > > deeper
> >> > > > > investigation of this topic, if it required.
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > So I feel that we have the following picture:
> >> > > > >
> >> > > > > 1)      DL integration investigation, could be part of Apache
> >> Bahir.
> >> > I
> >> > > > can
> >> > > > > perform futher investigation of this topic, but I thik we need
> >> some
> >> > > > > separated ticket for this to track this activity.
> >> > > > >
> >> > > > > 2)      GPU support, required for DL is interesting, but
> requires
> >> > ND4J
> >> > > > for
> >> > > > > example.
> >> > > > >
> >> > > > > 3)      ND4J couldn’t be incorporated because it doesn’t support
> >> > > sparsity
> >> > > > > <https://deeplearning4j.org/roadmap.html> [1].
> >> > > > >
> >> > > > > Regarding ND4J is this the single blocker for incorporation of
> it
> >> or
> >> > > may
> >> > > > be
> >> > > > > some others known?
> >> > > > >
> >> > > > >
> >> > > > > [1] https://deeplearning4j.org/roadmap.html
> >> > > > >
> >> > > > > вт, 7 февр. 2017 г. в 16:26, Till Rohrmann <
> trohrmann@apache.org
> >> >:
> >> > > > >
> >> > > > > Thanks for initiating this discussion Katherin. I think you're
> >> right
> >> > > that
> >> > > > > in general it does not make sense to reinvent the wheel over and
> >> over
> >> > > > > again. Especially if you only have limited resources at hand. So
> >> if
> >> > we
> >> > > > > could integrate Flink with some existing library that would be
> >> great.
> >> > > > >
> >> > > > > In the past, however, we couldn't find a good library which
> >> provided
> >> > > > enough
> >> > > > > freedom to integrate it with Flink. Especially if you want to
> have
> >> > > > > distributed and somewhat high-performance implementations of ML
> >> > > > algorithms
> >> > > > > you would have to take Flink's execution model (capabilities as
> >> well
> >> > as
> >> > > > > limitations) into account. That is mainly the reason why we
> >> started
> >> > > > > implementing some of the algorithms "natively" on Flink.
> >> > > > >
> >> > > > > If I remember correctly, then the problem with ND4J was and
> still
> >> is
> >> > > that
> >> > > > > it does not support sparse matrices which was a requirement from
> >> our
> >> > > > side.
> >> > > > > As far as I know, it is quite common that you have sparse data
> >> > > structures
> >> > > > > when dealing with large scale problems. That's why we built our
> >> own
> >> > > > > abstraction which can have different implementations. Currently,
> >> the
> >> > > > > default implementation uses Breeze.
> >> > > > >
> >> > > > > I think the support for GPU based operations and the actual
> >> resource
> >> > > > > management are two orthogonal things. The implementation would
> >> have
> >> > to
> >> > > > work
> >> > > > > with no GPUs available anyway. If the system detects that GPUs
> are
> >> > > > > available, then ideally it would exploit them. Thus, we could
> add
> >> > this
> >> > > > > feature later and maybe integrate it with FLINK-5131 [1].
> >> > > > >
> >> > > > > Concerning the integration with DL4J I think that Theo's
> proposal
> >> to
> >> > do
> >> > > > it
> >> > > > > in a separate repository (maybe as part of Apache Bahir) is a
> good
> >> > > idea.
> >> > > > > We're currently thinking about outsourcing some of Flink's
> >> libraries
> >> > > into
> >> > > > > sub projects. This could also be an option for the DL4J
> >> integration
> >> > > then.
> >> > > > > In general I think it should be feasible to run DL4J on Flink
> >> given
> >> > > that
> >> > > > it
> >> > > > > also runs on Spark. Have you already looked at it closer?
> >> > > > >
> >> > > > > [1] https://issues.apache.org/jira/browse/FLINK-5131
> >> > > > >
> >> > > > > Cheers,
> >> > > > > Till
> >> > > > >
> >> > > > > On Tue, Feb 7, 2017 at 11:47 AM, Katherin Eri <
> >> > katherinmail@gmail.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Thank you Theodore, for your reply.
> >> > > > > >
> >> > > > > > 1)    Regarding GPU, your point is clear and I agree with it,
> >> ND4J
> >> > > > looks
> >> > > > > > appropriate. But, my current understanding is that, we also
> >> need to
> >> > > > cover
> >> > > > > > some resource management questions -> when we need to provide
> >> GPU
> >> > > > support
> >> > > > > > we also need to manage it like resource. For example, Mesos
> has
> >> > > already
> >> > > > > > supported GPU like resource item: Initial support for GPU
> >> > resources.
> >> > > > > > <
> >> > https://issues.apache.org/jira/browse/MESOS-4424?jql=text%20~%20GPU
> >> > > >
> >> > > > > > Flink
> >> > > > > > uses Mesos as cluster manager, and this means that this
> feature
> >> of
> >> > > > Mesos
> >> > > > > > could be reused. Also memory managing questions in Flink
> >> regarding
> >> > > GPU
> >> > > > > > should be clarified.
> >> > > > > >
> >> > > > > > 2)    Regarding integration with DL4J: what stops us to
> >> initialize
> >> > > > ticket
> >> > > > > > and start the discussion around this topic? We need some user
> >> story
> >> > > or
> >> > > > > the
> >> > > > > > community is not sure that DL is really helpful? Why the
> >> discussion
> >> > > > with
> >> > > > > > Adam
> >> > > > > > Gibson just finished with no implementation of any idea? What
> >> > > concerns
> >> > > > do
> >> > > > > > we have?
> >> > > > > >
> >> > > > > > пн, 6 февр. 2017 г. в 15:01, Theodore Vasiloudis <
> >> > > > > > theodoros.vasiloudis@gmail.com>:
> >> > > > > >
> >> > > > > > > Hell all,
> >> > > > > > >
> >> > > > > > > This is point that has come up in the past: Given the
> >> multitude
> >> > of
> >> > > ML
> >> > > > > > > libraries out there, should we have native implementations
> in
> >> > > FlinkML
> >> > > > > or
> >> > > > > > > try to integrate other libraries instead?
> >> > > > > > >
> >> > > > > > > We haven't managed to reach a consensus on this before. My
> >> > opinion
> >> > > is
> >> > > > > > that
> >> > > > > > > there is definitely value in having ML algorithms written
> >> > natively
> >> > > in
> >> > > > > > > Flink, both for performance optimization,
> >> > > > > > > but more importantly for engineering simplicity, we don't
> >> want to
> >> > > > force
> >> > > > > > > users to use yet another piece of software to run their ML
> >> algos
> >> > > (at
> >> > > > > > least
> >> > > > > > > for a basic set of algorithms).
> >> > > > > > >
> >> > > > > > > We have in the past  discussed integrations with DL4J
> >> > (particularly
> >> > > > > ND4J)
> >> > > > > > > with Adam Gibson, the core developer of the library, but we
> >> never
> >> > > got
> >> > > > > > > around to implementing anything.
> >> > > > > > >
> >> > > > > > > Whether it makes sense to have an integration with DL4J as
> >> part
> >> > of
> >> > > > the
> >> > > > > > > Flink distribution would be up for discussion. I would
> >> suggest to
> >> > > > make
> >> > > > > it
> >> > > > > > > an independent repo to allow for
> >> > > > > > > faster dev/release cycles, and because it wouldn't be
> directly
> >> > > > related
> >> > > > > to
> >> > > > > > > the core of Flink so it would add extra reviewing burden to
> an
> >> > > > already
> >> > > > > > > overloaded group of committers.
> >> > > > > > >
> >> > > > > > > Natively supporting GPU calculations in Flink would be much
> >> > better
> >> > > > > > achieved
> >> > > > > > > through a library like ND4J, the engineering burden would be
> >> too
> >> > > much
> >> > > > > > > otherwise.
> >> > > > > > >
> >> > > > > > > Regards,
> >> > > > > > > Theodore
> >> > > > > > >
> >> > > > > > > On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <
> >> > > > katherinmail@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hello, guys.
> >> > > > > > > >
> >> > > > > > > > Theodore, last week I started the review of the PR:
> >> > > > > > > > https://github.com/apache/flink/pull/2735 related to
> >> *word2Vec
> >> > > for
> >> > > > > > > Flink*.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > During this review I have asked myself: why do we need to
> >> > > implement
> >> > > > > > such
> >> > > > > > > a
> >> > > > > > > > very popular algorithm like *word2vec one more time*, when
> >> > there
> >> > > is
> >> > > > > > > already
> >> > > > > > > > available implementation in java provided by
> >> > deeplearning4j.org
> >> > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J ->
> >> Apache
> >> > 2
> >> > > > > > > licence).
> >> > > > > > > > This library tries to promote itself, there is a hype
> >> around it
> >> > > in
> >> > > > ML
> >> > > > > > > > sphere, and it was integrated with Apache Spark, to
> provide
> >> > > > scalable
> >> > > > > > > > deeplearning calculations.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > *That's why I thought: could we integrate with this
> library
> >> or
> >> > > not
> >> > > > > also
> >> > > > > > > and
> >> > > > > > > > Flink? *
> >> > > > > > > >
> >> > > > > > > > 1) Personally I think, providing support and deployment of
> >> > > > > > > > *Deeplearning(DL)
> >> > > > > > > > algorithms/models in Flink* is promising and attractive
> >> > feature,
> >> > > > > > because:
> >> > > > > > > >
> >> > > > > > > >     a) during last two years DL proved its efficiency and
> >> these
> >> > > > > > > algorithms
> >> > > > > > > > used in many applications. For example *Spotify *uses DL
> >> based
> >> > > > > > algorithms
> >> > > > > > > > for music content extraction: Recommending music on
> Spotify
> >> > with
> >> > > > deep
> >> > > > > > > > learning AUGUST 05, 2014
> >> > > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html>
> for
> >> > > their
> >> > > > > > music
> >> > > > > > > > recommendations. Developers need to scale up DL manually,
> >> that
> >> > > > causes
> >> > > > > a
> >> > > > > > > lot
> >> > > > > > > > of work, so that’s why such platforms like Flink should
> >> support
> >> > > > these
> >> > > > > > > > models deployment.
> >> > > > > > > >
> >> > > > > > > >     b) Here is presented the scope of Deeplearning usage
> >> cases
> >> > > > > > > > <https://deeplearning4j.org/use_cases>, so many of this
> >> > > scenarios
> >> > > > > > > related
> >> > > > > > > > to scenarios, that could be supported on Flink.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > 2) But DL uncover such questions like:
> >> > > > > > > >
> >> > > > > > > >     a) scale up calculations over machines
> >> > > > > > > >
> >> > > > > > > >     b) perform these calculations both over CPU and GPU.
> >> GPU is
> >> > > > > > required
> >> > > > > > > to
> >> > > > > > > > train big DL models, otherwise learning process could have
> >> very
> >> > > > slow
> >> > > > > > > > convergence.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > 3) I have checked this DL4J library, which already have
> >> reach
> >> > > > support
> >> > > > > > of
> >> > > > > > > > many attractive DL models like: Recurrent Networks and
> >> LSTMs,
> >> > > > > > > Convolutional
> >> > > > > > > > Networks (CNN), Restricted Boltzmann Machines (RBM) and
> >> others.
> >> > > So
> >> > > > we
> >> > > > > > > won’t
> >> > > > > > > > need to implement them independently, but only provide the
> >> > > ability
> >> > > > of
> >> > > > > > > > execution of this models over Flink cluster, the quite
> >> similar
> >> > > way
> >> > > > > like
> >> > > > > > > it
> >> > > > > > > > was integrated with Apache Spark.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Because of all of this I propose:
> >> > > > > > > >
> >> > > > > > > > 1)    To create new ticket in Flink’s JIRA for integration
> >> of
> >> > > Flink
> >> > > > > > with
> >> > > > > > > > DL4J and decide on which side this integration should be
> >> > > > implemented.
> >> > > > > > > >
> >> > > > > > > > 2)    Support natively GPU resources in Flink and allow
> >> > > > calculations
> >> > > > > > over
> >> > > > > > > > them, like that is described in this publication
> >> > > > > > > > https://www.oreilly.com/learning/accelerating-spark-
> >> > > > > > workloads-using-gpus
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > *Regarding original issue Implement Word2Vec
> >> > > > > > > > <https://issues.apache.org/jira/browse/FLINK-2094>in
> Flink,
> >> > *I
> >> > > > have
> >> > > > > > > > investigated its implementation in DL4J and  that
> >> > implementation
> >> > > of
> >> > > > > > > > integration DL4J with Apache Spark, and got several
> points:
> >> > > > > > > >
> >> > > > > > > > It seems that idea of building of our own implementation
> of
> >> > > > word2vec
> >> > > > > in
> >> > > > > > > > Flink not such a bad solution, because: This DL4J was
> >> forced to
> >> > > > > > > reimplement
> >> > > > > > > > its original word2Vec over Spark. I have checked the
> >> > integration
> >> > > of
> >> > > > > > DL4J
> >> > > > > > > > with Spark, and found that it is too strongly coupled with
> >> > Spark
> >> > > > API,
> >> > > > > > so
> >> > > > > > > > that it is impossible just to take some DL4J API and reuse
> >> it,
> >> > > > > instead
> >> > > > > > we
> >> > > > > > > > need to implement independent integration for Flink.
> >> > > > > > > >
> >> > > > > > > > *That’s why we simply finish implementation of current PR
> >> > > > > > > > **independently **from
> >> > > > > > > > integration to DL4J.*
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Could you please provide your opinion regarding my
> questions
> >> > and
> >> > > > > > points,
> >> > > > > > > > what do you think about them?
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > пн, 6 февр. 2017 г. в 12:51, Katherin Eri <
> >> > > katherinmail@gmail.com
> >> > > > >:
> >> > > > > > > >
> >> > > > > > > > > Sorry, guys I need to finish this letter first.
> >> > > > > > > > >   Full version of it will come shortly.
> >> > > > > > > > >
> >> > > > > > > > > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <
> >> > > > katherinmail@gmail.com
> >> > > > > >:
> >> > > > > > > > >
> >> > > > > > > > > Hello, guys.
> >> > > > > > > > > Theodore, last week I started the review of the PR:
> >> > > > > > > > > https://github.com/apache/flink/pull/2735 related to
> >> > *word2Vec
> >> > > > for
> >> > > > > > > > Flink*.
> >> > > > > > > > >
> >> > > > > > > > > During this review I have asked myself: why do we need
> to
> >> > > > implement
> >> > > > > > > such
> >> > > > > > > > a
> >> > > > > > > > > very popular algorithm like *word2vec one more time*,
> when
> >> > > there
> >> > > > is
> >> > > > > > > > > already availabe implementation in java provided by
> >> > > > > > deeplearning4j.org
> >> > > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J ->
> >> > Apache
> >> > > 2
> >> > > > > > > > licence).
> >> > > > > > > > > This library tries to promote it self, there is a hype
> >> around
> >> > > it
> >> > > > in
> >> > > > > > ML
> >> > > > > > > > > sphere, and  it was integrated with Apache Spark, to
> >> provide
> >> > > > > scalable
> >> > > > > > > > > deeplearning calculations.
> >> > > > > > > > > That's why I thought: could we integrate with this
> >> library or
> >> > > not
> >> > > > > > also
> >> > > > > > > > and
> >> > > > > > > > > Flink?
> >> > > > > > > > > 1) Personally I think, providing support and deployment
> of
> >> > > > > > Deeplearning
> >> > > > > > > > > algorithms/models in Flink is promising and attractive
> >> > feature,
> >> > > > > > > because:
> >> > > > > > > > >     a) during last two years deeplearning proved its
> >> > efficiency
> >> > > > and
> >> > > > > > > this
> >> > > > > > > > > algorithms used in many applications. For example
> *Spotify
> >> > > *uses
> >> > > > DL
> >> > > > > > > based
> >> > > > > > > > > algorithms for music content extraction: Recommending
> >> music
> >> > on
> >> > > > > > Spotify
> >> > > > > > > > > with deep learning AUGUST 05, 2014
> >> > > > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html>
> >> for
> >> > > > their
> >> > > > > > > music
> >> > > > > > > > > recommendations. Doing this natively scalable is very
> >> > > attractive.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > I have investigated that implementation of integration
> >> DL4J
> >> > > with
> >> > > > > > Apache
> >> > > > > > > > > Spark, and got several points:
> >> > > > > > > > >
> >> > > > > > > > > 1) It seems that idea of building of our own
> >> implementation
> >> > of
> >> > > > > > word2vec
> >> > > > > > > > > not such a bad solution, because the integration of DL4J
> >> with
> >> > > > Spark
> >> > > > > > is
> >> > > > > > > > too
> >> > > > > > > > > strongly coupled with Saprk API and it will take time
> from
> >> > the
> >> > > > side
> >> > > > > > of
> >> > > > > > > > DL4J
> >> > > > > > > > > to adopt this integration to Flink. Also I have expected
> >> that
> >> > > we
> >> > > > > will
> >> > > > > > > be
> >> > > > > > > > > able to call just some API, it is not such thing.
> >> > > > > > > > > 2)
> >> > > > > > > > >
> >> > > > > > > > > https://deeplearning4j.org/use_cases
> >> > > > > > > > > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> >> > > > > > > > implementation-r-python/
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <
> >> > > trohrmann@apache.org
> >> > > > >:
> >> > > > > > > > >
> >> > > > > > > > > Hi Katherin,
> >> > > > > > > > >
> >> > > > > > > > > welcome to the Flink community. Always great to see new
> >> > people
> >> > > > > > joining
> >> > > > > > > > the
> >> > > > > > > > > community :-)
> >> > > > > > > > >
> >> > > > > > > > > Cheers,
> >> > > > > > > > > Till
> >> > > > > > > > >
> >> > > > > > > > > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> >> > > > > > > > katherinmail@gmail.com>
> >> > > > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > ok, I've got it.
> >> > > > > > > > > > I will take a look at
> >> > > > https://github.com/apache/flink/pull/2735
> >> > > > > .
> >> > > > > > > > > >
> >> > > > > > > > > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> >> > > > > > > > > > theodoros.vasiloudis@gmail.com>:
> >> > > > > > > > > >
> >> > > > > > > > > > > Hello Katherin,
> >> > > > > > > > > > >
> >> > > > > > > > > > > Welcome to the Flink community!
> >> > > > > > > > > > >
> >> > > > > > > > > > > The ML component definitely needs a lot of work you
> >> are
> >> > > > > correct,
> >> > > > > > we
> >> > > > > > > > are
> >> > > > > > > > > > > facing similar problems to CEP, which we'll
> hopefully
> >> > > resolve
> >> > > > > > with
> >> > > > > > > > the
> >> > > > > > > > > > > restructuring Stephan has mentioned in that thread.
> >> > > > > > > > > > >
> >> > > > > > > > > > > If you'd like to help out with PRs we have many
> open,
> >> > one I
> >> > > > > have
> >> > > > > > > > > started
> >> > > > > > > > > > > reviewing but got side-tracked is the Word2Vec one
> >> [1].
> >> > > > > > > > > > >
> >> > > > > > > > > > > Best,
> >> > > > > > > > > > > Theodore
> >> > > > > > > > > > >
> >> > > > > > > > > > > [1] https://github.com/apache/flink/pull/2735
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <
> >> > > > > > fhueske@gmail.com
> >> > > > > > > >
> >> > > > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Hi Katherin,
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > welcome to the Flink community!
> >> > > > > > > > > > > > Help with reviewing PRs is always very welcome
> and a
> >> > > great
> >> > > > > way
> >> > > > > > to
> >> > > > > > > > > > > > contribute.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Best, Fabian
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> >> > > > > > > > katherinmail@gmail.com
> >> > > > > > > > > >:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > Thank you, Timo.
> >> > > > > > > > > > > > > I have started the analysis of the topic.
> >> > > > > > > > > > > > > And if it necessary, I will try to perform the
> >> review
> >> > > of
> >> > > > > > other
> >> > > > > > > > > pulls)
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <
> >> > > > > > twalthr@apache.org
> >> > > > > > > >:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Hi Katherin,
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > great to hear that you would like to
> contribute!
> >> > > > Welcome!
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > I gave you contributor permissions. You can
> now
> >> > > assign
> >> > > > > > issues
> >> > > > > > > > to
> >> > > > > > > > > > > > > > yourself. I assigned FLINK-1750 to you.
> >> > > > > > > > > > > > > > Right now there are many open ML pull
> requests,
> >> you
> >> > > are
> >> > > > > > very
> >> > > > > > > > > > welcome
> >> > > > > > > > > > > to
> >> > > > > > > > > > > > > > review the code of others, too.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Timo
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> >> > > > > > > > > > > > > > > Hello, All!
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > I'm Kate Eri, I'm java developer with 6-year
> >> > > > enterprise
> >> > > > > > > > > > experience,
> >> > > > > > > > > > > > > also
> >> > > > > > > > > > > > > > I
> >> > > > > > > > > > > > > > > have some expertise with scala (half of the
> >> > year).
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Last 2 years I have participated in several
> >> > BigData
> >> > > > > > > projects
> >> > > > > > > > > that
> >> > > > > > > > > > > > were
> >> > > > > > > > > > > > > > > related to Machine Learning (Time series
> >> > analysis,
> >> > > > > > > > Recommender
> >> > > > > > > > > > > > systems,
> >> > > > > > > > > > > > > > > Social networking) and ETL. I have
> experience
> >> > with
> >> > > > > > Hadoop,
> >> > > > > > > > > Apache
> >> > > > > > > > > > > > Spark
> >> > > > > > > > > > > > > > and
> >> > > > > > > > > > > > > > > Hive.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > I’m fond of ML topic, and I see that Flink
> >> > project
> >> > > > > > requires
> >> > > > > > > > > some
> >> > > > > > > > > > > work
> >> > > > > > > > > > > > > in
> >> > > > > > > > > > > > > > > this area, so that’s why I would like to
> join
> >> > Flink
> >> > > > and
> >> > > > > > ask
> >> > > > > > > > me
> >> > > > > > > > > to
> >> > > > > > > > > > > > grant
> >> > > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > assignment of the ticket
> >> > > > > > > > > > > > > > https://issues.apache.org/jira
> >> /browse/FLINK-1750
> >> > > > > > > > > > > > > > > to me.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>
>

Re: New Flink team member - Kate Eri.

Posted by Katherin Eri <ka...@gmail.com>.

Thank you, Trevor!

You have shared very valuable points; I will consider them.

So I think, I should create finally ticket at Flink’s JIRA, at least for
Flink's GPU support and move the related discussion there?

I will contact to Suneel regarding DL4J, thanks!


пт, 10 февр. 2017 г. в 17:44, Trevor Grant <tr...@gmail.com>:

> Also RE: DL4J integration.
>
> Suneel had done some work on this a while back, and ran into issues.  You
> might want to chat with him about the pitfalls and 'gotchyas' there.
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Fri, Feb 10, 2017 at 7:37 AM, Trevor Grant <tr...@gmail.com>
> wrote:
>
> > Sorry for chiming in late.
> >
> > GPUs on Flink.  Till raised a good point- you need to be able to fall
> back
> > to non-GPU resources if they aren't available.
> >
> > Fun fact: this has already been developed for Flink vis-a-vis the Apache
> > Mahout project.
> >
> > In short- Mahout exposes a number of tensor functions (vector %*% matrix,
> > matrix %*% matrix, etc).  If compiled for GPU support, those operations
> are
> > completed via GPU- and if no GPUs are in fact available, Mahout math
> falls
> > back to CPUs (and finally back to the JVM).
> >
> > How this should work is Flink takes care of shipping data around the
> > cluster, and when data arrives at the local node- is dumped out to GPU
> for
> > calculation, loaded back up and shipped back around cluster.  In
> practice,
> > the lack of a persist method for intermediate results makes this
> > troublesome (not because of GPUs but for calculating any sort of complex
> > algorithm we expect to be able to cache intermediate results).
> >
> > +1 to FLINK-1730
> >
> > Everything in Mahout is modular- distributed engine
> > (Flink/Spark/Write-your-own), Native Solvers (OpenMP / ViennaCL / CUDA /
> > Write-your-own), algorithms, etc.
> >
> > So to sum up, you're noting the redundancy between ML packages in terms
> of
> > algorithms- I would recommend checking out Mahout before rolling your own
> > GPU integration (else risk redundantly integrating GPUs). If nothing
> else-
> > it should give you some valuable insight regarding design considerations.
> > Also FYI the goal of the Apache Mahout project is to address that problem
> > precisely- implement an algorithm once in a mathematically expressive
> DSL,
> > which is abstracted above the engine so the same code easily ports
> between
> > engines / native solvers (i.e. CPU/GPU).
> >
> > https://github.com/apache/mahout/tree/master/viennacl-omp
> > https://github.com/apache/mahout/tree/master/viennacl
> >
> > Best,
> > tg
> >
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Fri, Feb 10, 2017 at 7:01 AM, Katherin Eri <ka...@gmail.com>
> > wrote:
> >
> >> Thank you Felix, for provided information.
> >>
> >> Currently I analyze the provided integration of Flink with SystemML.
> >>
> >> And also gather the information for the ticket  FLINK-1730
> >> <https://issues.apache.org/jira/browse/FLINK-1730>, maybe we will take
> it
> >> to work, to unlock SystemML/Flink integration.
> >>
> >>
> >>
> >> чт, 9 февр. 2017 г. в 0:17, Felix Neutatz <neutatz@googlemail.com.invali
> >> d>:
> >>
> >> > Hi Kate,
> >> >
> >> > 1) - Broadcast:
> >> >
> >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-5%3A+
> >> Only+send+data+to+each+taskmanager+once+for+broadcasts
> >> >  - Caching: https://issues.apache.org/jira/browse/FLINK-1730
> >> >
> >> > 2) I have no idea about the GPU implementation. The SystemML mailing
> >> list
> >> > will probably help you out their.
> >> >
> >> > Best regards,
> >> > Felix
> >> >
> >> > 2017-02-08 14:33 GMT+01:00 Katherin Eri <ka...@gmail.com>:
> >> >
> >> > > Thank you Felix, for your point, it is quite interesting.
> >> > >
> >> > > I will take a look at the code, of the provided Flink integration.
> >> > >
> >> > > 1)    You have these problems with Flink: >>we realized that the
> lack
> >> of
> >> > a
> >> > > caching operator and a broadcast issue highly effects the
> performance,
> >> > have
> >> > > you already asked about this the community? In case yes: please
> >> provide
> >> > the
> >> > > reference to the ticket or the topic of letter.
> >> > >
> >> > > 2)    You have said, that SystemML provides GPU support. I have seen
> >> > > SystemML’s source code and would like to ask: why you have decided
> to
> >> > > implement your own integration with cuda? Did you try to consider
> >> ND4J,
> >> > or
> >> > > because it is younger, you support your own implementation?
> >> > >
> >> > > вт, 7 февр. 2017 г. в 18:35, Felix Neutatz <neutatz@googlemail.com
> >:
> >> > >
> >> > > > Hi Katherin,
> >> > > >
> >> > > > we are also working in a similar direction. We implemented a
> >> prototype
> >> > to
> >> > > > integrate with SystemML:
> >> > > > https://github.com/apache/incubator-systemml/pull/119
> >> > > > SystemML provides many different matrix formats, operations, GPU
> >> > support
> >> > > > and a couple of DL algorithms. Unfortunately, we realized that the
> >> lack
> >> > > of
> >> > > > a caching operator and a broadcast issue highly effects the
> >> performance
> >> > > > (e.g. compared to Spark). At the moment I am trying to tackle the
> >> > > broadcast
> >> > > > issue. But caching is still a problem for us.
> >> > > >
> >> > > > Best regards,
> >> > > > Felix
> >> > > >
> >> > > > 2017-02-07 16:22 GMT+01:00 Katherin Eri <ka...@gmail.com>:
> >> > > >
> >> > > > > Thank you, Till.
> >> > > > >
> >> > > > > 1)      Regarding ND4J, I didn’t know about such a pity and
> >> critical
> >> > > > > restriction of it -> lack of sparsity optimizations, and you are
> >> > right:
> >> > > > > this issue is still actual for them. I saw that Flink uses
> Breeze,
> >> > but
> >> > > I
> >> > > > > thought its usage caused by some historical reasons.
> >> > > > >
> >> > > > > 2)      Regarding integration with DL4J, I have read the source
> >> code
> >> > of
> >> > > > > DL4J/Spark integration, that’s why I have declined my idea of
> >> reuse
> >> > of
> >> > > > > their word2vec implementation for now, for example. I can
> perform
> >> > > deeper
> >> > > > > investigation of this topic, if it required.
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > So I feel that we have the following picture:
> >> > > > >
> >> > > > > 1)      DL integration investigation, could be part of Apache
> >> Bahir.
> >> > I
> >> > > > can
> >> > > > > perform futher investigation of this topic, but I thik we need
> >> some
> >> > > > > separated ticket for this to track this activity.
> >> > > > >
> >> > > > > 2)      GPU support, required for DL is interesting, but
> requires
> >> > ND4J
> >> > > > for
> >> > > > > example.
> >> > > > >
> >> > > > > 3)      ND4J couldn’t be incorporated because it doesn’t support
> >> > > sparsity
> >> > > > > <https://deeplearning4j.org/roadmap.html> [1].
> >> > > > >
> >> > > > > Regarding ND4J is this the single blocker for incorporation of
> it
> >> or
> >> > > may
> >> > > > be
> >> > > > > some others known?
> >> > > > >
> >> > > > >
> >> > > > > [1] https://deeplearning4j.org/roadmap.html
> >> > > > >
> >> > > > > вт, 7 февр. 2017 г. в 16:26, Till Rohrmann <
> trohrmann@apache.org
> >> >:
> >> > > > >
> >> > > > > Thanks for initiating this discussion Katherin. I think you're
> >> right
> >> > > that
> >> > > > > in general it does not make sense to reinvent the wheel over and
> >> over
> >> > > > > again. Especially if you only have limited resources at hand. So
> >> if
> >> > we
> >> > > > > could integrate Flink with some existing library that would be
> >> great.
> >> > > > >
> >> > > > > In the past, however, we couldn't find a good library which
> >> provided
> >> > > > enough
> >> > > > > freedom to integrate it with Flink. Especially if you want to
> have
> >> > > > > distributed and somewhat high-performance implementations of ML
> >> > > > algorithms
> >> > > > > you would have to take Flink's execution model (capabilities as
> >> well
> >> > as
> >> > > > > limitations) into account. That is mainly the reason why we
> >> started
> >> > > > > implementing some of the algorithms "natively" on Flink.
> >> > > > >
> >> > > > > If I remember correctly, then the problem with ND4J was and
> still
> >> is
> >> > > that
> >> > > > > it does not support sparse matrices which was a requirement from
> >> our
> >> > > > side.
> >> > > > > As far as I know, it is quite common that you have sparse data
> >> > > structures
> >> > > > > when dealing with large scale problems. That's why we built our
> >> own
> >> > > > > abstraction which can have different implementations. Currently,
> >> the
> >> > > > > default implementation uses Breeze.
> >> > > > >
> >> > > > > I think the support for GPU based operations and the actual
> >> resource
> >> > > > > management are two orthogonal things. The implementation would
> >> have
> >> > to
> >> > > > work
> >> > > > > with no GPUs available anyway. If the system detects that GPUs
> are
> >> > > > > available, then ideally it would exploit them. Thus, we could
> add
> >> > this
> >> > > > > feature later and maybe integrate it with FLINK-5131 [1].
> >> > > > >
> >> > > > > Concerning the integration with DL4J I think that Theo's
> proposal
> >> to
> >> > do
> >> > > > it
> >> > > > > in a separate repository (maybe as part of Apache Bahir) is a
> good
> >> > > idea.
> >> > > > > We're currently thinking about outsourcing some of Flink's
> >> libraries
> >> > > into
> >> > > > > sub projects. This could also be an option for the DL4J
> >> integration
> >> > > then.
> >> > > > > In general I think it should be feasible to run DL4J on Flink
> >> given
> >> > > that
> >> > > > it
> >> > > > > also runs on Spark. Have you already looked at it closer?
> >> > > > >
> >> > > > > [1] https://issues.apache.org/jira/browse/FLINK-5131
> >> > > > >
> >> > > > > Cheers,
> >> > > > > Till
> >> > > > >
> >> > > > > On Tue, Feb 7, 2017 at 11:47 AM, Katherin Eri <
> >> > katherinmail@gmail.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Thank you Theodore, for your reply.
> >> > > > > >
> >> > > > > > 1)    Regarding GPU, your point is clear and I agree with it,
> >> ND4J
> >> > > > looks
> >> > > > > > appropriate. But, my current understanding is that, we also
> >> need to
> >> > > > cover
> >> > > > > > some resource management questions -> when we need to provide
> >> GPU
> >> > > > support
> >> > > > > > we also need to manage it like resource. For example, Mesos
> has
> >> > > already
> >> > > > > > supported GPU like resource item: Initial support for GPU
> >> > resources.
> >> > > > > > <
> >> > https://issues.apache.org/jira/browse/MESOS-4424?jql=text%20~%20GPU
> >> > > >
> >> > > > > > Flink
> >> > > > > > uses Mesos as cluster manager, and this means that this
> feature
> >> of
> >> > > > Mesos
> >> > > > > > could be reused. Also memory managing questions in Flink
> >> regarding
> >> > > GPU
> >> > > > > > should be clarified.
> >> > > > > >
> >> > > > > > 2)    Regarding integration with DL4J: what stops us to
> >> initialize
> >> > > > ticket
> >> > > > > > and start the discussion around this topic? We need some user
> >> story
> >> > > or
> >> > > > > the
> >> > > > > > community is not sure that DL is really helpful? Why the
> >> discussion
> >> > > > with
> >> > > > > > Adam
> >> > > > > > Gibson just finished with no implementation of any idea? What
> >> > > concerns
> >> > > > do
> >> > > > > > we have?
> >> > > > > >
> >> > > > > > пн, 6 февр. 2017 г. в 15:01, Theodore Vasiloudis <
> >> > > > > > theodoros.vasiloudis@gmail.com>:
> >> > > > > >
> >> > > > > > > Hell all,
> >> > > > > > >
> >> > > > > > > This is point that has come up in the past: Given the
> >> multitude
> >> > of
> >> > > ML
> >> > > > > > > libraries out there, should we have native implementations
> in
> >> > > FlinkML
> >> > > > > or
> >> > > > > > > try to integrate other libraries instead?
> >> > > > > > >
> >> > > > > > > We haven't managed to reach a consensus on this before. My
> >> > opinion
> >> > > is
> >> > > > > > that
> >> > > > > > > there is definitely value in having ML algorithms written
> >> > natively
> >> > > in
> >> > > > > > > Flink, both for performance optimization,
> >> > > > > > > but more importantly for engineering simplicity, we don't
> >> want to
> >> > > > force
> >> > > > > > > users to use yet another piece of software to run their ML
> >> algos
> >> > > (at
> >> > > > > > least
> >> > > > > > > for a basic set of algorithms).
> >> > > > > > >
> >> > > > > > > We have in the past  discussed integrations with DL4J
> >> > (particularly
> >> > > > > ND4J)
> >> > > > > > > with Adam Gibson, the core developer of the library, but we
> >> never
> >> > > got
> >> > > > > > > around to implementing anything.
> >> > > > > > >
> >> > > > > > > Whether it makes sense to have an integration with DL4J as
> >> part
> >> > of
> >> > > > the
> >> > > > > > > Flink distribution would be up for discussion. I would
> >> suggest to
> >> > > > make
> >> > > > > it
> >> > > > > > > an independent repo to allow for
> >> > > > > > > faster dev/release cycles, and because it wouldn't be
> directly
> >> > > > related
> >> > > > > to
> >> > > > > > > the core of Flink so it would add extra reviewing burden to
> an
> >> > > > already
> >> > > > > > > overloaded group of committers.
> >> > > > > > >
> >> > > > > > > Natively supporting GPU calculations in Flink would be much
> >> > better
> >> > > > > > achieved
> >> > > > > > > through a library like ND4J, the engineering burden would be
> >> too
> >> > > much
> >> > > > > > > otherwise.
> >> > > > > > >
> >> > > > > > > Regards,
> >> > > > > > > Theodore
> >> > > > > > >
> >> > > > > > > On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <
> >> > > > katherinmail@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hello, guys.
> >> > > > > > > >
> >> > > > > > > > Theodore, last week I started the review of the PR:
> >> > > > > > > > https://github.com/apache/flink/pull/2735 related to
> >> *word2Vec
> >> > > for
> >> > > > > > > Flink*.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > During this review I have asked myself: why do we need to
> >> > > implement
> >> > > > > > such
> >> > > > > > > a
> >> > > > > > > > very popular algorithm like *word2vec one more time*, when
> >> > there
> >> > > is
> >> > > > > > > already
> >> > > > > > > > available implementation in java provided by
> >> > deeplearning4j.org
> >> > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J ->
> >> Apache
> >> > 2
> >> > > > > > > licence).
> >> > > > > > > > This library tries to promote itself, there is a hype
> >> around it
> >> > > in
> >> > > > ML
> >> > > > > > > > sphere, and it was integrated with Apache Spark, to
> provide
> >> > > > scalable
> >> > > > > > > > deeplearning calculations.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > *That's why I thought: could we integrate with this
> library
> >> or
> >> > > not
> >> > > > > also
> >> > > > > > > and
> >> > > > > > > > Flink? *
> >> > > > > > > >
> >> > > > > > > > 1) Personally I think, providing support and deployment of
> >> > > > > > > > *Deeplearning(DL)
> >> > > > > > > > algorithms/models in Flink* is promising and attractive
> >> > feature,
> >> > > > > > because:
> >> > > > > > > >
> >> > > > > > > >     a) during last two years DL proved its efficiency and
> >> these
> >> > > > > > > algorithms
> >> > > > > > > > used in many applications. For example *Spotify *uses DL
> >> based
> >> > > > > > algorithms
> >> > > > > > > > for music content extraction: Recommending music on
> Spotify
> >> > with
> >> > > > deep
> >> > > > > > > > learning AUGUST 05, 2014
> >> > > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html>
> for
> >> > > their
> >> > > > > > music
> >> > > > > > > > recommendations. Developers need to scale up DL manually,
> >> that
> >> > > > causes
> >> > > > > a
> >> > > > > > > lot
> >> > > > > > > > of work, so that’s why such platforms like Flink should
> >> support
> >> > > > these
> >> > > > > > > > models deployment.
> >> > > > > > > >
> >> > > > > > > >     b) Here is presented the scope of Deeplearning usage
> >> cases
> >> > > > > > > > <https://deeplearning4j.org/use_cases>, so many of this
> >> > > scenarios
> >> > > > > > > related
> >> > > > > > > > to scenarios, that could be supported on Flink.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > 2) But DL uncover such questions like:
> >> > > > > > > >
> >> > > > > > > >     a) scale up calculations over machines
> >> > > > > > > >
> >> > > > > > > >     b) perform these calculations both over CPU and GPU.
> >> GPU is
> >> > > > > > required
> >> > > > > > > to
> >> > > > > > > > train big DL models, otherwise learning process could have
> >> very
> >> > > > slow
> >> > > > > > > > convergence.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > 3) I have checked this DL4J library, which already have
> >> reach
> >> > > > support
> >> > > > > > of
> >> > > > > > > > many attractive DL models like: Recurrent Networks and
> >> LSTMs,
> >> > > > > > > Convolutional
> >> > > > > > > > Networks (CNN), Restricted Boltzmann Machines (RBM) and
> >> others.
> >> > > So
> >> > > > we
> >> > > > > > > won’t
> >> > > > > > > > need to implement them independently, but only provide the
> >> > > ability
> >> > > > of
> >> > > > > > > > execution of this models over Flink cluster, the quite
> >> similar
> >> > > way
> >> > > > > like
> >> > > > > > > it
> >> > > > > > > > was integrated with Apache Spark.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Because of all of this I propose:
> >> > > > > > > >
> >> > > > > > > > 1)    To create new ticket in Flink’s JIRA for integration
> >> of
> >> > > Flink
> >> > > > > > with
> >> > > > > > > > DL4J and decide on which side this integration should be
> >> > > > implemented.
> >> > > > > > > >
> >> > > > > > > > 2)    Support natively GPU resources in Flink and allow
> >> > > > calculations
> >> > > > > > over
> >> > > > > > > > them, like that is described in this publication
> >> > > > > > > > https://www.oreilly.com/learning/accelerating-spark-
> >> > > > > > workloads-using-gpus
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > *Regarding original issue Implement Word2Vec
> >> > > > > > > > <https://issues.apache.org/jira/browse/FLINK-2094>in
> Flink,
> >> > *I
> >> > > > have
> >> > > > > > > > investigated its implementation in DL4J and  that
> >> > implementation
> >> > > of
> >> > > > > > > > integration DL4J with Apache Spark, and got several
> points:
> >> > > > > > > >
> >> > > > > > > > It seems that idea of building of our own implementation
> of
> >> > > > word2vec
> >> > > > > in
> >> > > > > > > > Flink not such a bad solution, because: This DL4J was
> >> forced to
> >> > > > > > > reimplement
> >> > > > > > > > its original word2Vec over Spark. I have checked the
> >> > integration
> >> > > of
> >> > > > > > DL4J
> >> > > > > > > > with Spark, and found that it is too strongly coupled with
> >> > Spark
> >> > > > API,
> >> > > > > > so
> >> > > > > > > > that it is impossible just to take some DL4J API and reuse
> >> it,
> >> > > > > instead
> >> > > > > > we
> >> > > > > > > > need to implement independent integration for Flink.
> >> > > > > > > >
> >> > > > > > > > *That’s why we simply finish implementation of current PR
> >> > > > > > > > **independently **from
> >> > > > > > > > integration to DL4J.*
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Could you please provide your opinion regarding my
> questions
> >> > and
> >> > > > > > points,
> >> > > > > > > > what do you think about them?
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > пн, 6 февр. 2017 г. в 12:51, Katherin Eri <
> >> > > katherinmail@gmail.com
> >> > > > >:
> >> > > > > > > >
> >> > > > > > > > > Sorry, guys I need to finish this letter first.
> >> > > > > > > > >   Full version of it will come shortly.
> >> > > > > > > > >
> >> > > > > > > > > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <
> >> > > > katherinmail@gmail.com
> >> > > > > >:
> >> > > > > > > > >
> >> > > > > > > > > Hello, guys.
> >> > > > > > > > > Theodore, last week I started the review of the PR:
> >> > > > > > > > > https://github.com/apache/flink/pull/2735 related to
> >> > *word2Vec
> >> > > > for
> >> > > > > > > > Flink*.
> >> > > > > > > > >
> >> > > > > > > > > During this review I have asked myself: why do we need
> to
> >> > > > implement
> >> > > > > > > such
> >> > > > > > > > a
> >> > > > > > > > > very popular algorithm like *word2vec one more time*,
> when
> >> > > there
> >> > > > is
> >> > > > > > > > > already availabe implementation in java provided by
> >> > > > > > deeplearning4j.org
> >> > > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J ->
> >> > Apache
> >> > > 2
> >> > > > > > > > licence).
> >> > > > > > > > > This library tries to promote it self, there is a hype
> >> around
> >> > > it
> >> > > > in
> >> > > > > > ML
> >> > > > > > > > > sphere, and  it was integrated with Apache Spark, to
> >> provide
> >> > > > > scalable
> >> > > > > > > > > deeplearning calculations.
> >> > > > > > > > > That's why I thought: could we integrate with this
> >> library or
> >> > > not
> >> > > > > > also
> >> > > > > > > > and
> >> > > > > > > > > Flink?
> >> > > > > > > > > 1) Personally I think, providing support and deployment
> of
> >> > > > > > Deeplearning
> >> > > > > > > > > algorithms/models in Flink is promising and attractive
> >> > feature,
> >> > > > > > > because:
> >> > > > > > > > >     a) during last two years deeplearning proved its
> >> > efficiency
> >> > > > and
> >> > > > > > > this
> >> > > > > > > > > algorithms used in many applications. For example
> *Spotify
> >> > > *uses
> >> > > > DL
> >> > > > > > > based
> >> > > > > > > > > algorithms for music content extraction: Recommending
> >> music
> >> > on
> >> > > > > > Spotify
> >> > > > > > > > > with deep learning AUGUST 05, 2014
> >> > > > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html>
> >> for
> >> > > > their
> >> > > > > > > music
> >> > > > > > > > > recommendations. Doing this natively scalable is very
> >> > > attractive.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > I have investigated that implementation of integration
> >> DL4J
> >> > > with
> >> > > > > > Apache
> >> > > > > > > > > Spark, and got several points:
> >> > > > > > > > >
> >> > > > > > > > > 1) It seems that idea of building of our own
> >> implementation
> >> > of
> >> > > > > > word2vec
> >> > > > > > > > > not such a bad solution, because the integration of DL4J
> >> with
> >> > > > Spark
> >> > > > > > is
> >> > > > > > > > too
> >> > > > > > > > > strongly coupled with Saprk API and it will take time
> from
> >> > the
> >> > > > side
> >> > > > > > of
> >> > > > > > > > DL4J
> >> > > > > > > > > to adopt this integration to Flink. Also I have expected
> >> that
> >> > > we
> >> > > > > will
> >> > > > > > > be
> >> > > > > > > > > able to call just some API, it is not such thing.
> >> > > > > > > > > 2)
> >> > > > > > > > >
> >> > > > > > > > > https://deeplearning4j.org/use_cases
> >> > > > > > > > > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> >> > > > > > > > implementation-r-python/
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <
> >> > > trohrmann@apache.org
> >> > > > >:
> >> > > > > > > > >
> >> > > > > > > > > Hi Katherin,
> >> > > > > > > > >
> >> > > > > > > > > welcome to the Flink community. Always great to see new
> >> > people
> >> > > > > > joining
> >> > > > > > > > the
> >> > > > > > > > > community :-)
> >> > > > > > > > >
> >> > > > > > > > > Cheers,
> >> > > > > > > > > Till
> >> > > > > > > > >
> >> > > > > > > > > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> >> > > > > > > > katherinmail@gmail.com>
> >> > > > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > ok, I've got it.
> >> > > > > > > > > > I will take a look at
> >> > > > https://github.com/apache/flink/pull/2735
> >> > > > > .
> >> > > > > > > > > >
> >> > > > > > > > > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> >> > > > > > > > > > theodoros.vasiloudis@gmail.com>:
> >> > > > > > > > > >
> >> > > > > > > > > > > Hello Katherin,
> >> > > > > > > > > > >
> >> > > > > > > > > > > Welcome to the Flink community!
> >> > > > > > > > > > >
> >> > > > > > > > > > > The ML component definitely needs a lot of work you
> >> are
> >> > > > > correct,
> >> > > > > > we
> >> > > > > > > > are
> >> > > > > > > > > > > facing similar problems to CEP, which we'll
> hopefully
> >> > > resolve
> >> > > > > > with
> >> > > > > > > > the
> >> > > > > > > > > > > restructuring Stephan has mentioned in that thread.
> >> > > > > > > > > > >
> >> > > > > > > > > > > If you'd like to help out with PRs we have many
> open,
> >> > one I
> >> > > > > have
> >> > > > > > > > > started
> >> > > > > > > > > > > reviewing but got side-tracked is the Word2Vec one
> >> [1].
> >> > > > > > > > > > >
> >> > > > > > > > > > > Best,
> >> > > > > > > > > > > Theodore
> >> > > > > > > > > > >
> >> > > > > > > > > > > [1] https://github.com/apache/flink/pull/2735
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <
> >> > > > > > fhueske@gmail.com
> >> > > > > > > >
> >> > > > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Hi Katherin,
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > welcome to the Flink community!
> >> > > > > > > > > > > > Help with reviewing PRs is always very welcome
> and a
> >> > > great
> >> > > > > way
> >> > > > > > to
> >> > > > > > > > > > > > contribute.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Best, Fabian
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> >> > > > > > > > katherinmail@gmail.com
> >> > > > > > > > > >:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > Thank you, Timo.
> >> > > > > > > > > > > > > I have started the analysis of the topic.
> >> > > > > > > > > > > > > And if it necessary, I will try to perform the
> >> review
> >> > > of
> >> > > > > > other
> >> > > > > > > > > pulls)
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <
> >> > > > > > twalthr@apache.org
> >> > > > > > > >:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Hi Katherin,
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > great to hear that you would like to
> contribute!
> >> > > > Welcome!
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > I gave you contributor permissions. You can
> now
> >> > > assign
> >> > > > > > issues
> >> > > > > > > > to
> >> > > > > > > > > > > > > > yourself. I assigned FLINK-1750 to you.
> >> > > > > > > > > > > > > > Right now there are many open ML pull
> requests,
> >> you
> >> > > are
> >> > > > > > very
> >> > > > > > > > > > welcome
> >> > > > > > > > > > > to
> >> > > > > > > > > > > > > > review the code of others, too.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Timo
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> >> > > > > > > > > > > > > > > Hello, All!
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > I'm Kate Eri, I'm java developer with 6-year
> >> > > > enterprise
> >> > > > > > > > > > experience,
> >> > > > > > > > > > > > > also
> >> > > > > > > > > > > > > > I
> >> > > > > > > > > > > > > > > have some expertise with scala (half of the
> >> > year).
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Last 2 years I have participated in several
> >> > BigData
> >> > > > > > > projects
> >> > > > > > > > > that
> >> > > > > > > > > > > > were
> >> > > > > > > > > > > > > > > related to Machine Learning (Time series
> >> > analysis,
> >> > > > > > > > Recommender
> >> > > > > > > > > > > > systems,
> >> > > > > > > > > > > > > > > Social networking) and ETL. I have
> experience
> >> > with
> >> > > > > > Hadoop,
> >> > > > > > > > > Apache
> >> > > > > > > > > > > > Spark
> >> > > > > > > > > > > > > > and
> >> > > > > > > > > > > > > > > Hive.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > I’m fond of ML topic, and I see that Flink
> >> > project
> >> > > > > > requires
> >> > > > > > > > > some
> >> > > > > > > > > > > work
> >> > > > > > > > > > > > > in
> >> > > > > > > > > > > > > > > this area, so that’s why I would like to
> join
> >> > Flink
> >> > > > and
> >> > > > > > ask
> >> > > > > > > > me
> >> > > > > > > > > to
> >> > > > > > > > > > > > grant
> >> > > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > assignment of the ticket
> >> > > > > > > > > > > > > > https://issues.apache.org/jira
> >> /browse/FLINK-1750
> >> > > > > > > > > > > > > > > to me.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: New Flink team member - Kate Eri.

Posted by Trevor Grant <tr...@gmail.com>.

Also RE: DL4J integration.

Suneel had done some work on this a while back, and ran into issues.  You
might want to chat with him about the pitfalls and 'gotchyas' there.



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Fri, Feb 10, 2017 at 7:37 AM, Trevor Grant <tr...@gmail.com>
wrote:

> Sorry for chiming in late.
>
> GPUs on Flink.  Till raised a good point- you need to be able to fall back
> to non-GPU resources if they aren't available.
>
> Fun fact: this has already been developed for Flink vis-a-vis the Apache
> Mahout project.
>
> In short- Mahout exposes a number of tensor functions (vector %*% matrix,
> matrix %*% matrix, etc).  If compiled for GPU support, those operations are
> completed via GPU- and if no GPUs are in fact available, Mahout math falls
> back to CPUs (and finally back to the JVM).
>
> How this should work is Flink takes care of shipping data around the
> cluster, and when data arrives at the local node- is dumped out to GPU for
> calculation, loaded back up and shipped back around cluster.  In practice,
> the lack of a persist method for intermediate results makes this
> troublesome (not because of GPUs but for calculating any sort of complex
> algorithm we expect to be able to cache intermediate results).
>
> +1 to FLINK-1730
>
> Everything in Mahout is modular- distributed engine
> (Flink/Spark/Write-your-own), Native Solvers (OpenMP / ViennaCL / CUDA /
> Write-your-own), algorithms, etc.
>
> So to sum up, you're noting the redundancy between ML packages in terms of
> algorithms- I would recommend checking out Mahout before rolling your own
> GPU integration (else risk redundantly integrating GPUs). If nothing else-
> it should give you some valuable insight regarding design considerations.
> Also FYI the goal of the Apache Mahout project is to address that problem
> precisely- implement an algorithm once in a mathematically expressive DSL,
> which is abstracted above the engine so the same code easily ports between
> engines / native solvers (i.e. CPU/GPU).
>
> https://github.com/apache/mahout/tree/master/viennacl-omp
> https://github.com/apache/mahout/tree/master/viennacl
>
> Best,
> tg
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Fri, Feb 10, 2017 at 7:01 AM, Katherin Eri <ka...@gmail.com>
> wrote:
>
>> Thank you Felix, for provided information.
>>
>> Currently I analyze the provided integration of Flink with SystemML.
>>
>> And also gather the information for the ticket  FLINK-1730
>> <https://issues.apache.org/jira/browse/FLINK-1730>, maybe we will take it
>> to work, to unlock SystemML/Flink integration.
>>
>>
>>
>> чт, 9 февр. 2017 г. в 0:17, Felix Neutatz <neutatz@googlemail.com.invali
>> d>:
>>
>> > Hi Kate,
>> >
>> > 1) - Broadcast:
>> >
>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-5%3A+
>> Only+send+data+to+each+taskmanager+once+for+broadcasts
>> >  - Caching: https://issues.apache.org/jira/browse/FLINK-1730
>> >
>> > 2) I have no idea about the GPU implementation. The SystemML mailing
>> list
>> > will probably help you out their.
>> >
>> > Best regards,
>> > Felix
>> >
>> > 2017-02-08 14:33 GMT+01:00 Katherin Eri <ka...@gmail.com>:
>> >
>> > > Thank you Felix, for your point, it is quite interesting.
>> > >
>> > > I will take a look at the code, of the provided Flink integration.
>> > >
>> > > 1)    You have these problems with Flink: >>we realized that the lack
>> of
>> > a
>> > > caching operator and a broadcast issue highly effects the performance,
>> > have
>> > > you already asked about this the community? In case yes: please
>> provide
>> > the
>> > > reference to the ticket or the topic of letter.
>> > >
>> > > 2)    You have said, that SystemML provides GPU support. I have seen
>> > > SystemML’s source code and would like to ask: why you have decided to
>> > > implement your own integration with cuda? Did you try to consider
>> ND4J,
>> > or
>> > > because it is younger, you support your own implementation?
>> > >
>> > > вт, 7 февр. 2017 г. в 18:35, Felix Neutatz <ne...@googlemail.com>:
>> > >
>> > > > Hi Katherin,
>> > > >
>> > > > we are also working in a similar direction. We implemented a
>> prototype
>> > to
>> > > > integrate with SystemML:
>> > > > https://github.com/apache/incubator-systemml/pull/119
>> > > > SystemML provides many different matrix formats, operations, GPU
>> > support
>> > > > and a couple of DL algorithms. Unfortunately, we realized that the
>> lack
>> > > of
>> > > > a caching operator and a broadcast issue highly effects the
>> performance
>> > > > (e.g. compared to Spark). At the moment I am trying to tackle the
>> > > broadcast
>> > > > issue. But caching is still a problem for us.
>> > > >
>> > > > Best regards,
>> > > > Felix
>> > > >
>> > > > 2017-02-07 16:22 GMT+01:00 Katherin Eri <ka...@gmail.com>:
>> > > >
>> > > > > Thank you, Till.
>> > > > >
>> > > > > 1)      Regarding ND4J, I didn’t know about such a pity and
>> critical
>> > > > > restriction of it -> lack of sparsity optimizations, and you are
>> > right:
>> > > > > this issue is still actual for them. I saw that Flink uses Breeze,
>> > but
>> > > I
>> > > > > thought its usage caused by some historical reasons.
>> > > > >
>> > > > > 2)      Regarding integration with DL4J, I have read the source
>> code
>> > of
>> > > > > DL4J/Spark integration, that’s why I have declined my idea of
>> reuse
>> > of
>> > > > > their word2vec implementation for now, for example. I can perform
>> > > deeper
>> > > > > investigation of this topic, if it required.
>> > > > >
>> > > > >
>> > > > >
>> > > > > So I feel that we have the following picture:
>> > > > >
>> > > > > 1)      DL integration investigation, could be part of Apache
>> Bahir.
>> > I
>> > > > can
>> > > > > perform futher investigation of this topic, but I thik we need
>> some
>> > > > > separated ticket for this to track this activity.
>> > > > >
>> > > > > 2)      GPU support, required for DL is interesting, but requires
>> > ND4J
>> > > > for
>> > > > > example.
>> > > > >
>> > > > > 3)      ND4J couldn’t be incorporated because it doesn’t support
>> > > sparsity
>> > > > > <https://deeplearning4j.org/roadmap.html> [1].
>> > > > >
>> > > > > Regarding ND4J is this the single blocker for incorporation of it
>> or
>> > > may
>> > > > be
>> > > > > some others known?
>> > > > >
>> > > > >
>> > > > > [1] https://deeplearning4j.org/roadmap.html
>> > > > >
>> > > > > вт, 7 февр. 2017 г. в 16:26, Till Rohrmann <trohrmann@apache.org
>> >:
>> > > > >
>> > > > > Thanks for initiating this discussion Katherin. I think you're
>> right
>> > > that
>> > > > > in general it does not make sense to reinvent the wheel over and
>> over
>> > > > > again. Especially if you only have limited resources at hand. So
>> if
>> > we
>> > > > > could integrate Flink with some existing library that would be
>> great.
>> > > > >
>> > > > > In the past, however, we couldn't find a good library which
>> provided
>> > > > enough
>> > > > > freedom to integrate it with Flink. Especially if you want to have
>> > > > > distributed and somewhat high-performance implementations of ML
>> > > > algorithms
>> > > > > you would have to take Flink's execution model (capabilities as
>> well
>> > as
>> > > > > limitations) into account. That is mainly the reason why we
>> started
>> > > > > implementing some of the algorithms "natively" on Flink.
>> > > > >
>> > > > > If I remember correctly, then the problem with ND4J was and still
>> is
>> > > that
>> > > > > it does not support sparse matrices which was a requirement from
>> our
>> > > > side.
>> > > > > As far as I know, it is quite common that you have sparse data
>> > > structures
>> > > > > when dealing with large scale problems. That's why we built our
>> own
>> > > > > abstraction which can have different implementations. Currently,
>> the
>> > > > > default implementation uses Breeze.
>> > > > >
>> > > > > I think the support for GPU based operations and the actual
>> resource
>> > > > > management are two orthogonal things. The implementation would
>> have
>> > to
>> > > > work
>> > > > > with no GPUs available anyway. If the system detects that GPUs are
>> > > > > available, then ideally it would exploit them. Thus, we could add
>> > this
>> > > > > feature later and maybe integrate it with FLINK-5131 [1].
>> > > > >
>> > > > > Concerning the integration with DL4J I think that Theo's proposal
>> to
>> > do
>> > > > it
>> > > > > in a separate repository (maybe as part of Apache Bahir) is a good
>> > > idea.
>> > > > > We're currently thinking about outsourcing some of Flink's
>> libraries
>> > > into
>> > > > > sub projects. This could also be an option for the DL4J
>> integration
>> > > then.
>> > > > > In general I think it should be feasible to run DL4J on Flink
>> given
>> > > that
>> > > > it
>> > > > > also runs on Spark. Have you already looked at it closer?
>> > > > >
>> > > > > [1] https://issues.apache.org/jira/browse/FLINK-5131
>> > > > >
>> > > > > Cheers,
>> > > > > Till
>> > > > >
>> > > > > On Tue, Feb 7, 2017 at 11:47 AM, Katherin Eri <
>> > katherinmail@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Thank you Theodore, for your reply.
>> > > > > >
>> > > > > > 1)    Regarding GPU, your point is clear and I agree with it,
>> ND4J
>> > > > looks
>> > > > > > appropriate. But, my current understanding is that, we also
>> need to
>> > > > cover
>> > > > > > some resource management questions -> when we need to provide
>> GPU
>> > > > support
>> > > > > > we also need to manage it like resource. For example, Mesos has
>> > > already
>> > > > > > supported GPU like resource item: Initial support for GPU
>> > resources.
>> > > > > > <
>> > https://issues.apache.org/jira/browse/MESOS-4424?jql=text%20~%20GPU
>> > > >
>> > > > > > Flink
>> > > > > > uses Mesos as cluster manager, and this means that this feature
>> of
>> > > > Mesos
>> > > > > > could be reused. Also memory managing questions in Flink
>> regarding
>> > > GPU
>> > > > > > should be clarified.
>> > > > > >
>> > > > > > 2)    Regarding integration with DL4J: what stops us to
>> initialize
>> > > > ticket
>> > > > > > and start the discussion around this topic? We need some user
>> story
>> > > or
>> > > > > the
>> > > > > > community is not sure that DL is really helpful? Why the
>> discussion
>> > > > with
>> > > > > > Adam
>> > > > > > Gibson just finished with no implementation of any idea? What
>> > > concerns
>> > > > do
>> > > > > > we have?
>> > > > > >
>> > > > > > пн, 6 февр. 2017 г. в 15:01, Theodore Vasiloudis <
>> > > > > > theodoros.vasiloudis@gmail.com>:
>> > > > > >
>> > > > > > > Hell all,
>> > > > > > >
>> > > > > > > This is point that has come up in the past: Given the
>> multitude
>> > of
>> > > ML
>> > > > > > > libraries out there, should we have native implementations in
>> > > FlinkML
>> > > > > or
>> > > > > > > try to integrate other libraries instead?
>> > > > > > >
>> > > > > > > We haven't managed to reach a consensus on this before. My
>> > opinion
>> > > is
>> > > > > > that
>> > > > > > > there is definitely value in having ML algorithms written
>> > natively
>> > > in
>> > > > > > > Flink, both for performance optimization,
>> > > > > > > but more importantly for engineering simplicity, we don't
>> want to
>> > > > force
>> > > > > > > users to use yet another piece of software to run their ML
>> algos
>> > > (at
>> > > > > > least
>> > > > > > > for a basic set of algorithms).
>> > > > > > >
>> > > > > > > We have in the past  discussed integrations with DL4J
>> > (particularly
>> > > > > ND4J)
>> > > > > > > with Adam Gibson, the core developer of the library, but we
>> never
>> > > got
>> > > > > > > around to implementing anything.
>> > > > > > >
>> > > > > > > Whether it makes sense to have an integration with DL4J as
>> part
>> > of
>> > > > the
>> > > > > > > Flink distribution would be up for discussion. I would
>> suggest to
>> > > > make
>> > > > > it
>> > > > > > > an independent repo to allow for
>> > > > > > > faster dev/release cycles, and because it wouldn't be directly
>> > > > related
>> > > > > to
>> > > > > > > the core of Flink so it would add extra reviewing burden to an
>> > > > already
>> > > > > > > overloaded group of committers.
>> > > > > > >
>> > > > > > > Natively supporting GPU calculations in Flink would be much
>> > better
>> > > > > > achieved
>> > > > > > > through a library like ND4J, the engineering burden would be
>> too
>> > > much
>> > > > > > > otherwise.
>> > > > > > >
>> > > > > > > Regards,
>> > > > > > > Theodore
>> > > > > > >
>> > > > > > > On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <
>> > > > katherinmail@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hello, guys.
>> > > > > > > >
>> > > > > > > > Theodore, last week I started the review of the PR:
>> > > > > > > > https://github.com/apache/flink/pull/2735 related to
>> *word2Vec
>> > > for
>> > > > > > > Flink*.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > During this review I have asked myself: why do we need to
>> > > implement
>> > > > > > such
>> > > > > > > a
>> > > > > > > > very popular algorithm like *word2vec one more time*, when
>> > there
>> > > is
>> > > > > > > already
>> > > > > > > > available implementation in java provided by
>> > deeplearning4j.org
>> > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J ->
>> Apache
>> > 2
>> > > > > > > licence).
>> > > > > > > > This library tries to promote itself, there is a hype
>> around it
>> > > in
>> > > > ML
>> > > > > > > > sphere, and it was integrated with Apache Spark, to provide
>> > > > scalable
>> > > > > > > > deeplearning calculations.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > *That's why I thought: could we integrate with this library
>> or
>> > > not
>> > > > > also
>> > > > > > > and
>> > > > > > > > Flink? *
>> > > > > > > >
>> > > > > > > > 1) Personally I think, providing support and deployment of
>> > > > > > > > *Deeplearning(DL)
>> > > > > > > > algorithms/models in Flink* is promising and attractive
>> > feature,
>> > > > > > because:
>> > > > > > > >
>> > > > > > > >     a) during last two years DL proved its efficiency and
>> these
>> > > > > > > algorithms
>> > > > > > > > used in many applications. For example *Spotify *uses DL
>> based
>> > > > > > algorithms
>> > > > > > > > for music content extraction: Recommending music on Spotify
>> > with
>> > > > deep
>> > > > > > > > learning AUGUST 05, 2014
>> > > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for
>> > > their
>> > > > > > music
>> > > > > > > > recommendations. Developers need to scale up DL manually,
>> that
>> > > > causes
>> > > > > a
>> > > > > > > lot
>> > > > > > > > of work, so that’s why such platforms like Flink should
>> support
>> > > > these
>> > > > > > > > models deployment.
>> > > > > > > >
>> > > > > > > >     b) Here is presented the scope of Deeplearning usage
>> cases
>> > > > > > > > <https://deeplearning4j.org/use_cases>, so many of this
>> > > scenarios
>> > > > > > > related
>> > > > > > > > to scenarios, that could be supported on Flink.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > 2) But DL uncover such questions like:
>> > > > > > > >
>> > > > > > > >     a) scale up calculations over machines
>> > > > > > > >
>> > > > > > > >     b) perform these calculations both over CPU and GPU.
>> GPU is
>> > > > > > required
>> > > > > > > to
>> > > > > > > > train big DL models, otherwise learning process could have
>> very
>> > > > slow
>> > > > > > > > convergence.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > 3) I have checked this DL4J library, which already have
>> reach
>> > > > support
>> > > > > > of
>> > > > > > > > many attractive DL models like: Recurrent Networks and
>> LSTMs,
>> > > > > > > Convolutional
>> > > > > > > > Networks (CNN), Restricted Boltzmann Machines (RBM) and
>> others.
>> > > So
>> > > > we
>> > > > > > > won’t
>> > > > > > > > need to implement them independently, but only provide the
>> > > ability
>> > > > of
>> > > > > > > > execution of this models over Flink cluster, the quite
>> similar
>> > > way
>> > > > > like
>> > > > > > > it
>> > > > > > > > was integrated with Apache Spark.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Because of all of this I propose:
>> > > > > > > >
>> > > > > > > > 1)    To create new ticket in Flink’s JIRA for integration
>> of
>> > > Flink
>> > > > > > with
>> > > > > > > > DL4J and decide on which side this integration should be
>> > > > implemented.
>> > > > > > > >
>> > > > > > > > 2)    Support natively GPU resources in Flink and allow
>> > > > calculations
>> > > > > > over
>> > > > > > > > them, like that is described in this publication
>> > > > > > > > https://www.oreilly.com/learning/accelerating-spark-
>> > > > > > workloads-using-gpus
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > *Regarding original issue Implement Word2Vec
>> > > > > > > > <https://issues.apache.org/jira/browse/FLINK-2094>in Flink,
>> > *I
>> > > > have
>> > > > > > > > investigated its implementation in DL4J and  that
>> > implementation
>> > > of
>> > > > > > > > integration DL4J with Apache Spark, and got several points:
>> > > > > > > >
>> > > > > > > > It seems that idea of building of our own implementation of
>> > > > word2vec
>> > > > > in
>> > > > > > > > Flink not such a bad solution, because: This DL4J was
>> forced to
>> > > > > > > reimplement
>> > > > > > > > its original word2Vec over Spark. I have checked the
>> > integration
>> > > of
>> > > > > > DL4J
>> > > > > > > > with Spark, and found that it is too strongly coupled with
>> > Spark
>> > > > API,
>> > > > > > so
>> > > > > > > > that it is impossible just to take some DL4J API and reuse
>> it,
>> > > > > instead
>> > > > > > we
>> > > > > > > > need to implement independent integration for Flink.
>> > > > > > > >
>> > > > > > > > *That’s why we simply finish implementation of current PR
>> > > > > > > > **independently **from
>> > > > > > > > integration to DL4J.*
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Could you please provide your opinion regarding my questions
>> > and
>> > > > > > points,
>> > > > > > > > what do you think about them?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > пн, 6 февр. 2017 г. в 12:51, Katherin Eri <
>> > > katherinmail@gmail.com
>> > > > >:
>> > > > > > > >
>> > > > > > > > > Sorry, guys I need to finish this letter first.
>> > > > > > > > >   Full version of it will come shortly.
>> > > > > > > > >
>> > > > > > > > > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <
>> > > > katherinmail@gmail.com
>> > > > > >:
>> > > > > > > > >
>> > > > > > > > > Hello, guys.
>> > > > > > > > > Theodore, last week I started the review of the PR:
>> > > > > > > > > https://github.com/apache/flink/pull/2735 related to
>> > *word2Vec
>> > > > for
>> > > > > > > > Flink*.
>> > > > > > > > >
>> > > > > > > > > During this review I have asked myself: why do we need to
>> > > > implement
>> > > > > > > such
>> > > > > > > > a
>> > > > > > > > > very popular algorithm like *word2vec one more time*, when
>> > > there
>> > > > is
>> > > > > > > > > already availabe implementation in java provided by
>> > > > > > deeplearning4j.org
>> > > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J ->
>> > Apache
>> > > 2
>> > > > > > > > licence).
>> > > > > > > > > This library tries to promote it self, there is a hype
>> around
>> > > it
>> > > > in
>> > > > > > ML
>> > > > > > > > > sphere, and  it was integrated with Apache Spark, to
>> provide
>> > > > > scalable
>> > > > > > > > > deeplearning calculations.
>> > > > > > > > > That's why I thought: could we integrate with this
>> library or
>> > > not
>> > > > > > also
>> > > > > > > > and
>> > > > > > > > > Flink?
>> > > > > > > > > 1) Personally I think, providing support and deployment of
>> > > > > > Deeplearning
>> > > > > > > > > algorithms/models in Flink is promising and attractive
>> > feature,
>> > > > > > > because:
>> > > > > > > > >     a) during last two years deeplearning proved its
>> > efficiency
>> > > > and
>> > > > > > > this
>> > > > > > > > > algorithms used in many applications. For example *Spotify
>> > > *uses
>> > > > DL
>> > > > > > > based
>> > > > > > > > > algorithms for music content extraction: Recommending
>> music
>> > on
>> > > > > > Spotify
>> > > > > > > > > with deep learning AUGUST 05, 2014
>> > > > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html>
>> for
>> > > > their
>> > > > > > > music
>> > > > > > > > > recommendations. Doing this natively scalable is very
>> > > attractive.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I have investigated that implementation of integration
>> DL4J
>> > > with
>> > > > > > Apache
>> > > > > > > > > Spark, and got several points:
>> > > > > > > > >
>> > > > > > > > > 1) It seems that idea of building of our own
>> implementation
>> > of
>> > > > > > word2vec
>> > > > > > > > > not such a bad solution, because the integration of DL4J
>> with
>> > > > Spark
>> > > > > > is
>> > > > > > > > too
>> > > > > > > > > strongly coupled with Saprk API and it will take time from
>> > the
>> > > > side
>> > > > > > of
>> > > > > > > > DL4J
>> > > > > > > > > to adopt this integration to Flink. Also I have expected
>> that
>> > > we
>> > > > > will
>> > > > > > > be
>> > > > > > > > > able to call just some API, it is not such thing.
>> > > > > > > > > 2)
>> > > > > > > > >
>> > > > > > > > > https://deeplearning4j.org/use_cases
>> > > > > > > > > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
>> > > > > > > > implementation-r-python/
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <
>> > > trohrmann@apache.org
>> > > > >:
>> > > > > > > > >
>> > > > > > > > > Hi Katherin,
>> > > > > > > > >
>> > > > > > > > > welcome to the Flink community. Always great to see new
>> > people
>> > > > > > joining
>> > > > > > > > the
>> > > > > > > > > community :-)
>> > > > > > > > >
>> > > > > > > > > Cheers,
>> > > > > > > > > Till
>> > > > > > > > >
>> > > > > > > > > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
>> > > > > > > > katherinmail@gmail.com>
>> > > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > ok, I've got it.
>> > > > > > > > > > I will take a look at
>> > > > https://github.com/apache/flink/pull/2735
>> > > > > .
>> > > > > > > > > >
>> > > > > > > > > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
>> > > > > > > > > > theodoros.vasiloudis@gmail.com>:
>> > > > > > > > > >
>> > > > > > > > > > > Hello Katherin,
>> > > > > > > > > > >
>> > > > > > > > > > > Welcome to the Flink community!
>> > > > > > > > > > >
>> > > > > > > > > > > The ML component definitely needs a lot of work you
>> are
>> > > > > correct,
>> > > > > > we
>> > > > > > > > are
>> > > > > > > > > > > facing similar problems to CEP, which we'll hopefully
>> > > resolve
>> > > > > > with
>> > > > > > > > the
>> > > > > > > > > > > restructuring Stephan has mentioned in that thread.
>> > > > > > > > > > >
>> > > > > > > > > > > If you'd like to help out with PRs we have many open,
>> > one I
>> > > > > have
>> > > > > > > > > started
>> > > > > > > > > > > reviewing but got side-tracked is the Word2Vec one
>> [1].
>> > > > > > > > > > >
>> > > > > > > > > > > Best,
>> > > > > > > > > > > Theodore
>> > > > > > > > > > >
>> > > > > > > > > > > [1] https://github.com/apache/flink/pull/2735
>> > > > > > > > > > >
>> > > > > > > > > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <
>> > > > > > fhueske@gmail.com
>> > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Hi Katherin,
>> > > > > > > > > > > >
>> > > > > > > > > > > > welcome to the Flink community!
>> > > > > > > > > > > > Help with reviewing PRs is always very welcome and a
>> > > great
>> > > > > way
>> > > > > > to
>> > > > > > > > > > > > contribute.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Best, Fabian
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
>> > > > > > > > katherinmail@gmail.com
>> > > > > > > > > >:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > Thank you, Timo.
>> > > > > > > > > > > > > I have started the analysis of the topic.
>> > > > > > > > > > > > > And if it necessary, I will try to perform the
>> review
>> > > of
>> > > > > > other
>> > > > > > > > > pulls)
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <
>> > > > > > twalthr@apache.org
>> > > > > > > >:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > Hi Katherin,
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > great to hear that you would like to contribute!
>> > > > Welcome!
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > I gave you contributor permissions. You can now
>> > > assign
>> > > > > > issues
>> > > > > > > > to
>> > > > > > > > > > > > > > yourself. I assigned FLINK-1750 to you.
>> > > > > > > > > > > > > > Right now there are many open ML pull requests,
>> you
>> > > are
>> > > > > > very
>> > > > > > > > > > welcome
>> > > > > > > > > > > to
>> > > > > > > > > > > > > > review the code of others, too.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Timo
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
>> > > > > > > > > > > > > > > Hello, All!
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > I'm Kate Eri, I'm java developer with 6-year
>> > > > enterprise
>> > > > > > > > > > experience,
>> > > > > > > > > > > > > also
>> > > > > > > > > > > > > > I
>> > > > > > > > > > > > > > > have some expertise with scala (half of the
>> > year).
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Last 2 years I have participated in several
>> > BigData
>> > > > > > > projects
>> > > > > > > > > that
>> > > > > > > > > > > > were
>> > > > > > > > > > > > > > > related to Machine Learning (Time series
>> > analysis,
>> > > > > > > > Recommender
>> > > > > > > > > > > > systems,
>> > > > > > > > > > > > > > > Social networking) and ETL. I have experience
>> > with
>> > > > > > Hadoop,
>> > > > > > > > > Apache
>> > > > > > > > > > > > Spark
>> > > > > > > > > > > > > > and
>> > > > > > > > > > > > > > > Hive.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > I’m fond of ML topic, and I see that Flink
>> > project
>> > > > > > requires
>> > > > > > > > > some
>> > > > > > > > > > > work
>> > > > > > > > > > > > > in
>> > > > > > > > > > > > > > > this area, so that’s why I would like to join
>> > Flink
>> > > > and
>> > > > > > ask
>> > > > > > > > me
>> > > > > > > > > to
>> > > > > > > > > > > > grant
>> > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > assignment of the ticket
>> > > > > > > > > > > > > > https://issues.apache.org/jira
>> /browse/FLINK-1750
>> > > > > > > > > > > > > > > to me.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: New Flink team member - Kate Eri.

Posted by Trevor Grant <tr...@gmail.com>.

Sorry for chiming in late.

GPUs on Flink.  Till raised a good point- you need to be able to fall back
to non-GPU resources if they aren't available.

Fun fact: this has already been developed for Flink vis-a-vis the Apache
Mahout project.

In short- Mahout exposes a number of tensor functions (vector %*% matrix,
matrix %*% matrix, etc).  If compiled for GPU support, those operations are
completed via GPU- and if no GPUs are in fact available, Mahout math falls
back to CPUs (and finally back to the JVM).

How this should work is Flink takes care of shipping data around the
cluster, and when data arrives at the local node- is dumped out to GPU for
calculation, loaded back up and shipped back around cluster.  In practice,
the lack of a persist method for intermediate results makes this
troublesome (not because of GPUs but for calculating any sort of complex
algorithm we expect to be able to cache intermediate results).

+1 to FLINK-1730

Everything in Mahout is modular- distributed engine
(Flink/Spark/Write-your-own), Native Solvers (OpenMP / ViennaCL / CUDA /
Write-your-own), algorithms, etc.

So to sum up, you're noting the redundancy between ML packages in terms of
algorithms- I would recommend checking out Mahout before rolling your own
GPU integration (else risk redundantly integrating GPUs). If nothing else-
it should give you some valuable insight regarding design considerations.
Also FYI the goal of the Apache Mahout project is to address that problem
precisely- implement an algorithm once in a mathematically expressive DSL,
which is abstracted above the engine so the same code easily ports between
engines / native solvers (i.e. CPU/GPU).

https://github.com/apache/mahout/tree/master/viennacl-omp
https://github.com/apache/mahout/tree/master/viennacl

Best,
tg


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Fri, Feb 10, 2017 at 7:01 AM, Katherin Eri <ka...@gmail.com>
wrote:

> Thank you Felix, for provided information.
>
> Currently I analyze the provided integration of Flink with SystemML.
>
> And also gather the information for the ticket  FLINK-1730
> <https://issues.apache.org/jira/browse/FLINK-1730>, maybe we will take it
> to work, to unlock SystemML/Flink integration.
>
>
>
> чт, 9 февр. 2017 г. в 0:17, Felix Neutatz <neutatz@googlemail.com.
> invalid>:
>
> > Hi Kate,
> >
> > 1) - Broadcast:
> >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-
> 5%3A+Only+send+data+to+each+taskmanager+once+for+broadcasts
> >  - Caching: https://issues.apache.org/jira/browse/FLINK-1730
> >
> > 2) I have no idea about the GPU implementation. The SystemML mailing list
> > will probably help you out their.
> >
> > Best regards,
> > Felix
> >
> > 2017-02-08 14:33 GMT+01:00 Katherin Eri <ka...@gmail.com>:
> >
> > > Thank you Felix, for your point, it is quite interesting.
> > >
> > > I will take a look at the code, of the provided Flink integration.
> > >
> > > 1)    You have these problems with Flink: >>we realized that the lack
> of
> > a
> > > caching operator and a broadcast issue highly effects the performance,
> > have
> > > you already asked about this the community? In case yes: please provide
> > the
> > > reference to the ticket or the topic of letter.
> > >
> > > 2)    You have said, that SystemML provides GPU support. I have seen
> > > SystemML’s source code and would like to ask: why you have decided to
> > > implement your own integration with cuda? Did you try to consider ND4J,
> > or
> > > because it is younger, you support your own implementation?
> > >
> > > вт, 7 февр. 2017 г. в 18:35, Felix Neutatz <ne...@googlemail.com>:
> > >
> > > > Hi Katherin,
> > > >
> > > > we are also working in a similar direction. We implemented a
> prototype
> > to
> > > > integrate with SystemML:
> > > > https://github.com/apache/incubator-systemml/pull/119
> > > > SystemML provides many different matrix formats, operations, GPU
> > support
> > > > and a couple of DL algorithms. Unfortunately, we realized that the
> lack
> > > of
> > > > a caching operator and a broadcast issue highly effects the
> performance
> > > > (e.g. compared to Spark). At the moment I am trying to tackle the
> > > broadcast
> > > > issue. But caching is still a problem for us.
> > > >
> > > > Best regards,
> > > > Felix
> > > >
> > > > 2017-02-07 16:22 GMT+01:00 Katherin Eri <ka...@gmail.com>:
> > > >
> > > > > Thank you, Till.
> > > > >
> > > > > 1)      Regarding ND4J, I didn’t know about such a pity and
> critical
> > > > > restriction of it -> lack of sparsity optimizations, and you are
> > right:
> > > > > this issue is still actual for them. I saw that Flink uses Breeze,
> > but
> > > I
> > > > > thought its usage caused by some historical reasons.
> > > > >
> > > > > 2)      Regarding integration with DL4J, I have read the source
> code
> > of
> > > > > DL4J/Spark integration, that’s why I have declined my idea of reuse
> > of
> > > > > their word2vec implementation for now, for example. I can perform
> > > deeper
> > > > > investigation of this topic, if it required.
> > > > >
> > > > >
> > > > >
> > > > > So I feel that we have the following picture:
> > > > >
> > > > > 1)      DL integration investigation, could be part of Apache
> Bahir.
> > I
> > > > can
> > > > > perform futher investigation of this topic, but I thik we need some
> > > > > separated ticket for this to track this activity.
> > > > >
> > > > > 2)      GPU support, required for DL is interesting, but requires
> > ND4J
> > > > for
> > > > > example.
> > > > >
> > > > > 3)      ND4J couldn’t be incorporated because it doesn’t support
> > > sparsity
> > > > > <https://deeplearning4j.org/roadmap.html> [1].
> > > > >
> > > > > Regarding ND4J is this the single blocker for incorporation of it
> or
> > > may
> > > > be
> > > > > some others known?
> > > > >
> > > > >
> > > > > [1] https://deeplearning4j.org/roadmap.html
> > > > >
> > > > > вт, 7 февр. 2017 г. в 16:26, Till Rohrmann <tr...@apache.org>:
> > > > >
> > > > > Thanks for initiating this discussion Katherin. I think you're
> right
> > > that
> > > > > in general it does not make sense to reinvent the wheel over and
> over
> > > > > again. Especially if you only have limited resources at hand. So if
> > we
> > > > > could integrate Flink with some existing library that would be
> great.
> > > > >
> > > > > In the past, however, we couldn't find a good library which
> provided
> > > > enough
> > > > > freedom to integrate it with Flink. Especially if you want to have
> > > > > distributed and somewhat high-performance implementations of ML
> > > > algorithms
> > > > > you would have to take Flink's execution model (capabilities as
> well
> > as
> > > > > limitations) into account. That is mainly the reason why we started
> > > > > implementing some of the algorithms "natively" on Flink.
> > > > >
> > > > > If I remember correctly, then the problem with ND4J was and still
> is
> > > that
> > > > > it does not support sparse matrices which was a requirement from
> our
> > > > side.
> > > > > As far as I know, it is quite common that you have sparse data
> > > structures
> > > > > when dealing with large scale problems. That's why we built our own
> > > > > abstraction which can have different implementations. Currently,
> the
> > > > > default implementation uses Breeze.
> > > > >
> > > > > I think the support for GPU based operations and the actual
> resource
> > > > > management are two orthogonal things. The implementation would have
> > to
> > > > work
> > > > > with no GPUs available anyway. If the system detects that GPUs are
> > > > > available, then ideally it would exploit them. Thus, we could add
> > this
> > > > > feature later and maybe integrate it with FLINK-5131 [1].
> > > > >
> > > > > Concerning the integration with DL4J I think that Theo's proposal
> to
> > do
> > > > it
> > > > > in a separate repository (maybe as part of Apache Bahir) is a good
> > > idea.
> > > > > We're currently thinking about outsourcing some of Flink's
> libraries
> > > into
> > > > > sub projects. This could also be an option for the DL4J integration
> > > then.
> > > > > In general I think it should be feasible to run DL4J on Flink given
> > > that
> > > > it
> > > > > also runs on Spark. Have you already looked at it closer?
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/FLINK-5131
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Tue, Feb 7, 2017 at 11:47 AM, Katherin Eri <
> > katherinmail@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thank you Theodore, for your reply.
> > > > > >
> > > > > > 1)    Regarding GPU, your point is clear and I agree with it,
> ND4J
> > > > looks
> > > > > > appropriate. But, my current understanding is that, we also need
> to
> > > > cover
> > > > > > some resource management questions -> when we need to provide GPU
> > > > support
> > > > > > we also need to manage it like resource. For example, Mesos has
> > > already
> > > > > > supported GPU like resource item: Initial support for GPU
> > resources.
> > > > > > <
> > https://issues.apache.org/jira/browse/MESOS-4424?jql=text%20~%20GPU
> > > >
> > > > > > Flink
> > > > > > uses Mesos as cluster manager, and this means that this feature
> of
> > > > Mesos
> > > > > > could be reused. Also memory managing questions in Flink
> regarding
> > > GPU
> > > > > > should be clarified.
> > > > > >
> > > > > > 2)    Regarding integration with DL4J: what stops us to
> initialize
> > > > ticket
> > > > > > and start the discussion around this topic? We need some user
> story
> > > or
> > > > > the
> > > > > > community is not sure that DL is really helpful? Why the
> discussion
> > > > with
> > > > > > Adam
> > > > > > Gibson just finished with no implementation of any idea? What
> > > concerns
> > > > do
> > > > > > we have?
> > > > > >
> > > > > > пн, 6 февр. 2017 г. в 15:01, Theodore Vasiloudis <
> > > > > > theodoros.vasiloudis@gmail.com>:
> > > > > >
> > > > > > > Hell all,
> > > > > > >
> > > > > > > This is point that has come up in the past: Given the multitude
> > of
> > > ML
> > > > > > > libraries out there, should we have native implementations in
> > > FlinkML
> > > > > or
> > > > > > > try to integrate other libraries instead?
> > > > > > >
> > > > > > > We haven't managed to reach a consensus on this before. My
> > opinion
> > > is
> > > > > > that
> > > > > > > there is definitely value in having ML algorithms written
> > natively
> > > in
> > > > > > > Flink, both for performance optimization,
> > > > > > > but more importantly for engineering simplicity, we don't want
> to
> > > > force
> > > > > > > users to use yet another piece of software to run their ML
> algos
> > > (at
> > > > > > least
> > > > > > > for a basic set of algorithms).
> > > > > > >
> > > > > > > We have in the past  discussed integrations with DL4J
> > (particularly
> > > > > ND4J)
> > > > > > > with Adam Gibson, the core developer of the library, but we
> never
> > > got
> > > > > > > around to implementing anything.
> > > > > > >
> > > > > > > Whether it makes sense to have an integration with DL4J as part
> > of
> > > > the
> > > > > > > Flink distribution would be up for discussion. I would suggest
> to
> > > > make
> > > > > it
> > > > > > > an independent repo to allow for
> > > > > > > faster dev/release cycles, and because it wouldn't be directly
> > > > related
> > > > > to
> > > > > > > the core of Flink so it would add extra reviewing burden to an
> > > > already
> > > > > > > overloaded group of committers.
> > > > > > >
> > > > > > > Natively supporting GPU calculations in Flink would be much
> > better
> > > > > > achieved
> > > > > > > through a library like ND4J, the engineering burden would be
> too
> > > much
> > > > > > > otherwise.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Theodore
> > > > > > >
> > > > > > > On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <
> > > > katherinmail@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello, guys.
> > > > > > > >
> > > > > > > > Theodore, last week I started the review of the PR:
> > > > > > > > https://github.com/apache/flink/pull/2735 related to
> *word2Vec
> > > for
> > > > > > > Flink*.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > During this review I have asked myself: why do we need to
> > > implement
> > > > > > such
> > > > > > > a
> > > > > > > > very popular algorithm like *word2vec one more time*, when
> > there
> > > is
> > > > > > > already
> > > > > > > > available implementation in java provided by
> > deeplearning4j.org
> > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J ->
> Apache
> > 2
> > > > > > > licence).
> > > > > > > > This library tries to promote itself, there is a hype around
> it
> > > in
> > > > ML
> > > > > > > > sphere, and it was integrated with Apache Spark, to provide
> > > > scalable
> > > > > > > > deeplearning calculations.
> > > > > > > >
> > > > > > > >
> > > > > > > > *That's why I thought: could we integrate with this library
> or
> > > not
> > > > > also
> > > > > > > and
> > > > > > > > Flink? *
> > > > > > > >
> > > > > > > > 1) Personally I think, providing support and deployment of
> > > > > > > > *Deeplearning(DL)
> > > > > > > > algorithms/models in Flink* is promising and attractive
> > feature,
> > > > > > because:
> > > > > > > >
> > > > > > > >     a) during last two years DL proved its efficiency and
> these
> > > > > > > algorithms
> > > > > > > > used in many applications. For example *Spotify *uses DL
> based
> > > > > > algorithms
> > > > > > > > for music content extraction: Recommending music on Spotify
> > with
> > > > deep
> > > > > > > > learning AUGUST 05, 2014
> > > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for
> > > their
> > > > > > music
> > > > > > > > recommendations. Developers need to scale up DL manually,
> that
> > > > causes
> > > > > a
> > > > > > > lot
> > > > > > > > of work, so that’s why such platforms like Flink should
> support
> > > > these
> > > > > > > > models deployment.
> > > > > > > >
> > > > > > > >     b) Here is presented the scope of Deeplearning usage
> cases
> > > > > > > > <https://deeplearning4j.org/use_cases>, so many of this
> > > scenarios
> > > > > > > related
> > > > > > > > to scenarios, that could be supported on Flink.
> > > > > > > >
> > > > > > > >
> > > > > > > > 2) But DL uncover such questions like:
> > > > > > > >
> > > > > > > >     a) scale up calculations over machines
> > > > > > > >
> > > > > > > >     b) perform these calculations both over CPU and GPU. GPU
> is
> > > > > > required
> > > > > > > to
> > > > > > > > train big DL models, otherwise learning process could have
> very
> > > > slow
> > > > > > > > convergence.
> > > > > > > >
> > > > > > > >
> > > > > > > > 3) I have checked this DL4J library, which already have reach
> > > > support
> > > > > > of
> > > > > > > > many attractive DL models like: Recurrent Networks and LSTMs,
> > > > > > > Convolutional
> > > > > > > > Networks (CNN), Restricted Boltzmann Machines (RBM) and
> others.
> > > So
> > > > we
> > > > > > > won’t
> > > > > > > > need to implement them independently, but only provide the
> > > ability
> > > > of
> > > > > > > > execution of this models over Flink cluster, the quite
> similar
> > > way
> > > > > like
> > > > > > > it
> > > > > > > > was integrated with Apache Spark.
> > > > > > > >
> > > > > > > >
> > > > > > > > Because of all of this I propose:
> > > > > > > >
> > > > > > > > 1)    To create new ticket in Flink’s JIRA for integration of
> > > Flink
> > > > > > with
> > > > > > > > DL4J and decide on which side this integration should be
> > > > implemented.
> > > > > > > >
> > > > > > > > 2)    Support natively GPU resources in Flink and allow
> > > > calculations
> > > > > > over
> > > > > > > > them, like that is described in this publication
> > > > > > > > https://www.oreilly.com/learning/accelerating-spark-
> > > > > > workloads-using-gpus
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > *Regarding original issue Implement Word2Vec
> > > > > > > > <https://issues.apache.org/jira/browse/FLINK-2094>in Flink,
> > *I
> > > > have
> > > > > > > > investigated its implementation in DL4J and  that
> > implementation
> > > of
> > > > > > > > integration DL4J with Apache Spark, and got several points:
> > > > > > > >
> > > > > > > > It seems that idea of building of our own implementation of
> > > > word2vec
> > > > > in
> > > > > > > > Flink not such a bad solution, because: This DL4J was forced
> to
> > > > > > > reimplement
> > > > > > > > its original word2Vec over Spark. I have checked the
> > integration
> > > of
> > > > > > DL4J
> > > > > > > > with Spark, and found that it is too strongly coupled with
> > Spark
> > > > API,
> > > > > > so
> > > > > > > > that it is impossible just to take some DL4J API and reuse
> it,
> > > > > instead
> > > > > > we
> > > > > > > > need to implement independent integration for Flink.
> > > > > > > >
> > > > > > > > *That’s why we simply finish implementation of current PR
> > > > > > > > **independently **from
> > > > > > > > integration to DL4J.*
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Could you please provide your opinion regarding my questions
> > and
> > > > > > points,
> > > > > > > > what do you think about them?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > пн, 6 февр. 2017 г. в 12:51, Katherin Eri <
> > > katherinmail@gmail.com
> > > > >:
> > > > > > > >
> > > > > > > > > Sorry, guys I need to finish this letter first.
> > > > > > > > >   Full version of it will come shortly.
> > > > > > > > >
> > > > > > > > > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <
> > > > katherinmail@gmail.com
> > > > > >:
> > > > > > > > >
> > > > > > > > > Hello, guys.
> > > > > > > > > Theodore, last week I started the review of the PR:
> > > > > > > > > https://github.com/apache/flink/pull/2735 related to
> > *word2Vec
> > > > for
> > > > > > > > Flink*.
> > > > > > > > >
> > > > > > > > > During this review I have asked myself: why do we need to
> > > > implement
> > > > > > > such
> > > > > > > > a
> > > > > > > > > very popular algorithm like *word2vec one more time*, when
> > > there
> > > > is
> > > > > > > > > already availabe implementation in java provided by
> > > > > > deeplearning4j.org
> > > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J ->
> > Apache
> > > 2
> > > > > > > > licence).
> > > > > > > > > This library tries to promote it self, there is a hype
> around
> > > it
> > > > in
> > > > > > ML
> > > > > > > > > sphere, and  it was integrated with Apache Spark, to
> provide
> > > > > scalable
> > > > > > > > > deeplearning calculations.
> > > > > > > > > That's why I thought: could we integrate with this library
> or
> > > not
> > > > > > also
> > > > > > > > and
> > > > > > > > > Flink?
> > > > > > > > > 1) Personally I think, providing support and deployment of
> > > > > > Deeplearning
> > > > > > > > > algorithms/models in Flink is promising and attractive
> > feature,
> > > > > > > because:
> > > > > > > > >     a) during last two years deeplearning proved its
> > efficiency
> > > > and
> > > > > > > this
> > > > > > > > > algorithms used in many applications. For example *Spotify
> > > *uses
> > > > DL
> > > > > > > based
> > > > > > > > > algorithms for music content extraction: Recommending music
> > on
> > > > > > Spotify
> > > > > > > > > with deep learning AUGUST 05, 2014
> > > > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html>
> for
> > > > their
> > > > > > > music
> > > > > > > > > recommendations. Doing this natively scalable is very
> > > attractive.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I have investigated that implementation of integration DL4J
> > > with
> > > > > > Apache
> > > > > > > > > Spark, and got several points:
> > > > > > > > >
> > > > > > > > > 1) It seems that idea of building of our own implementation
> > of
> > > > > > word2vec
> > > > > > > > > not such a bad solution, because the integration of DL4J
> with
> > > > Spark
> > > > > > is
> > > > > > > > too
> > > > > > > > > strongly coupled with Saprk API and it will take time from
> > the
> > > > side
> > > > > > of
> > > > > > > > DL4J
> > > > > > > > > to adopt this integration to Flink. Also I have expected
> that
> > > we
> > > > > will
> > > > > > > be
> > > > > > > > > able to call just some API, it is not such thing.
> > > > > > > > > 2)
> > > > > > > > >
> > > > > > > > > https://deeplearning4j.org/use_cases
> > > > > > > > > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> > > > > > > > implementation-r-python/
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <
> > > trohrmann@apache.org
> > > > >:
> > > > > > > > >
> > > > > > > > > Hi Katherin,
> > > > > > > > >
> > > > > > > > > welcome to the Flink community. Always great to see new
> > people
> > > > > > joining
> > > > > > > > the
> > > > > > > > > community :-)
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Till
> > > > > > > > >
> > > > > > > > > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> > > > > > > > katherinmail@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > ok, I've got it.
> > > > > > > > > > I will take a look at
> > > > https://github.com/apache/flink/pull/2735
> > > > > .
> > > > > > > > > >
> > > > > > > > > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > > > > > > > > > theodoros.vasiloudis@gmail.com>:
> > > > > > > > > >
> > > > > > > > > > > Hello Katherin,
> > > > > > > > > > >
> > > > > > > > > > > Welcome to the Flink community!
> > > > > > > > > > >
> > > > > > > > > > > The ML component definitely needs a lot of work you are
> > > > > correct,
> > > > > > we
> > > > > > > > are
> > > > > > > > > > > facing similar problems to CEP, which we'll hopefully
> > > resolve
> > > > > > with
> > > > > > > > the
> > > > > > > > > > > restructuring Stephan has mentioned in that thread.
> > > > > > > > > > >
> > > > > > > > > > > If you'd like to help out with PRs we have many open,
> > one I
> > > > > have
> > > > > > > > > started
> > > > > > > > > > > reviewing but got side-tracked is the Word2Vec one [1].
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Theodore
> > > > > > > > > > >
> > > > > > > > > > > [1] https://github.com/apache/flink/pull/2735
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <
> > > > > > fhueske@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Katherin,
> > > > > > > > > > > >
> > > > > > > > > > > > welcome to the Flink community!
> > > > > > > > > > > > Help with reviewing PRs is always very welcome and a
> > > great
> > > > > way
> > > > > > to
> > > > > > > > > > > > contribute.
> > > > > > > > > > > >
> > > > > > > > > > > > Best, Fabian
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> > > > > > > > katherinmail@gmail.com
> > > > > > > > > >:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thank you, Timo.
> > > > > > > > > > > > > I have started the analysis of the topic.
> > > > > > > > > > > > > And if it necessary, I will try to perform the
> review
> > > of
> > > > > > other
> > > > > > > > > pulls)
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <
> > > > > > twalthr@apache.org
> > > > > > > >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Katherin,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > great to hear that you would like to contribute!
> > > > Welcome!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I gave you contributor permissions. You can now
> > > assign
> > > > > > issues
> > > > > > > > to
> > > > > > > > > > > > > > yourself. I assigned FLINK-1750 to you.
> > > > > > > > > > > > > > Right now there are many open ML pull requests,
> you
> > > are
> > > > > > very
> > > > > > > > > > welcome
> > > > > > > > > > > to
> > > > > > > > > > > > > > review the code of others, too.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Timo
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> > > > > > > > > > > > > > > Hello, All!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'm Kate Eri, I'm java developer with 6-year
> > > > enterprise
> > > > > > > > > > experience,
> > > > > > > > > > > > > also
> > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > have some expertise with scala (half of the
> > year).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Last 2 years I have participated in several
> > BigData
> > > > > > > projects
> > > > > > > > > that
> > > > > > > > > > > > were
> > > > > > > > > > > > > > > related to Machine Learning (Time series
> > analysis,
> > > > > > > > Recommender
> > > > > > > > > > > > systems,
> > > > > > > > > > > > > > > Social networking) and ETL. I have experience
> > with
> > > > > > Hadoop,
> > > > > > > > > Apache
> > > > > > > > > > > > Spark
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > Hive.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I’m fond of ML topic, and I see that Flink
> > project
> > > > > > requires
> > > > > > > > > some
> > > > > > > > > > > work
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > this area, so that’s why I would like to join
> > Flink
> > > > and
> > > > > > ask
> > > > > > > > me
> > > > > > > > > to
> > > > > > > > > > > > grant
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > assignment of the ticket
> > > > > > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-1750
> > > > > > > > > > > > > > > to me.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: New Flink team member - Kate Eri.

Posted by Katherin Eri <ka...@gmail.com>.

Thank you Felix, for provided information.

Currently I analyze the provided integration of Flink with SystemML.

And also gather the information for the ticket  FLINK-1730
<https://issues.apache.org/jira/browse/FLINK-1730>, maybe we will take it
to work, to unlock SystemML/Flink integration.



чт, 9 февр. 2017 г. в 0:17, Felix Neutatz <ne...@googlemail.com.invalid>:

> Hi Kate,
>
> 1) - Broadcast:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-5%3A+Only+send+data+to+each+taskmanager+once+for+broadcasts
>  - Caching: https://issues.apache.org/jira/browse/FLINK-1730
>
> 2) I have no idea about the GPU implementation. The SystemML mailing list
> will probably help you out their.
>
> Best regards,
> Felix
>
> 2017-02-08 14:33 GMT+01:00 Katherin Eri <ka...@gmail.com>:
>
> > Thank you Felix, for your point, it is quite interesting.
> >
> > I will take a look at the code, of the provided Flink integration.
> >
> > 1)    You have these problems with Flink: >>we realized that the lack of
> a
> > caching operator and a broadcast issue highly effects the performance,
> have
> > you already asked about this the community? In case yes: please provide
> the
> > reference to the ticket or the topic of letter.
> >
> > 2)    You have said, that SystemML provides GPU support. I have seen
> > SystemML’s source code and would like to ask: why you have decided to
> > implement your own integration with cuda? Did you try to consider ND4J,
> or
> > because it is younger, you support your own implementation?
> >
> > вт, 7 февр. 2017 г. в 18:35, Felix Neutatz <ne...@googlemail.com>:
> >
> > > Hi Katherin,
> > >
> > > we are also working in a similar direction. We implemented a prototype
> to
> > > integrate with SystemML:
> > > https://github.com/apache/incubator-systemml/pull/119
> > > SystemML provides many different matrix formats, operations, GPU
> support
> > > and a couple of DL algorithms. Unfortunately, we realized that the lack
> > of
> > > a caching operator and a broadcast issue highly effects the performance
> > > (e.g. compared to Spark). At the moment I am trying to tackle the
> > broadcast
> > > issue. But caching is still a problem for us.
> > >
> > > Best regards,
> > > Felix
> > >
> > > 2017-02-07 16:22 GMT+01:00 Katherin Eri <ka...@gmail.com>:
> > >
> > > > Thank you, Till.
> > > >
> > > > 1)      Regarding ND4J, I didn’t know about such a pity and critical
> > > > restriction of it -> lack of sparsity optimizations, and you are
> right:
> > > > this issue is still actual for them. I saw that Flink uses Breeze,
> but
> > I
> > > > thought its usage caused by some historical reasons.
> > > >
> > > > 2)      Regarding integration with DL4J, I have read the source code
> of
> > > > DL4J/Spark integration, that’s why I have declined my idea of reuse
> of
> > > > their word2vec implementation for now, for example. I can perform
> > deeper
> > > > investigation of this topic, if it required.
> > > >
> > > >
> > > >
> > > > So I feel that we have the following picture:
> > > >
> > > > 1)      DL integration investigation, could be part of Apache Bahir.
> I
> > > can
> > > > perform futher investigation of this topic, but I thik we need some
> > > > separated ticket for this to track this activity.
> > > >
> > > > 2)      GPU support, required for DL is interesting, but requires
> ND4J
> > > for
> > > > example.
> > > >
> > > > 3)      ND4J couldn’t be incorporated because it doesn’t support
> > sparsity
> > > > <https://deeplearning4j.org/roadmap.html> [1].
> > > >
> > > > Regarding ND4J is this the single blocker for incorporation of it or
> > may
> > > be
> > > > some others known?
> > > >
> > > >
> > > > [1] https://deeplearning4j.org/roadmap.html
> > > >
> > > > вт, 7 февр. 2017 г. в 16:26, Till Rohrmann <tr...@apache.org>:
> > > >
> > > > Thanks for initiating this discussion Katherin. I think you're right
> > that
> > > > in general it does not make sense to reinvent the wheel over and over
> > > > again. Especially if you only have limited resources at hand. So if
> we
> > > > could integrate Flink with some existing library that would be great.
> > > >
> > > > In the past, however, we couldn't find a good library which provided
> > > enough
> > > > freedom to integrate it with Flink. Especially if you want to have
> > > > distributed and somewhat high-performance implementations of ML
> > > algorithms
> > > > you would have to take Flink's execution model (capabilities as well
> as
> > > > limitations) into account. That is mainly the reason why we started
> > > > implementing some of the algorithms "natively" on Flink.
> > > >
> > > > If I remember correctly, then the problem with ND4J was and still is
> > that
> > > > it does not support sparse matrices which was a requirement from our
> > > side.
> > > > As far as I know, it is quite common that you have sparse data
> > structures
> > > > when dealing with large scale problems. That's why we built our own
> > > > abstraction which can have different implementations. Currently, the
> > > > default implementation uses Breeze.
> > > >
> > > > I think the support for GPU based operations and the actual resource
> > > > management are two orthogonal things. The implementation would have
> to
> > > work
> > > > with no GPUs available anyway. If the system detects that GPUs are
> > > > available, then ideally it would exploit them. Thus, we could add
> this
> > > > feature later and maybe integrate it with FLINK-5131 [1].
> > > >
> > > > Concerning the integration with DL4J I think that Theo's proposal to
> do
> > > it
> > > > in a separate repository (maybe as part of Apache Bahir) is a good
> > idea.
> > > > We're currently thinking about outsourcing some of Flink's libraries
> > into
> > > > sub projects. This could also be an option for the DL4J integration
> > then.
> > > > In general I think it should be feasible to run DL4J on Flink given
> > that
> > > it
> > > > also runs on Spark. Have you already looked at it closer?
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-5131
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Tue, Feb 7, 2017 at 11:47 AM, Katherin Eri <
> katherinmail@gmail.com>
> > > > wrote:
> > > >
> > > > > Thank you Theodore, for your reply.
> > > > >
> > > > > 1)    Regarding GPU, your point is clear and I agree with it, ND4J
> > > looks
> > > > > appropriate. But, my current understanding is that, we also need to
> > > cover
> > > > > some resource management questions -> when we need to provide GPU
> > > support
> > > > > we also need to manage it like resource. For example, Mesos has
> > already
> > > > > supported GPU like resource item: Initial support for GPU
> resources.
> > > > > <
> https://issues.apache.org/jira/browse/MESOS-4424?jql=text%20~%20GPU
> > >
> > > > > Flink
> > > > > uses Mesos as cluster manager, and this means that this feature of
> > > Mesos
> > > > > could be reused. Also memory managing questions in Flink regarding
> > GPU
> > > > > should be clarified.
> > > > >
> > > > > 2)    Regarding integration with DL4J: what stops us to initialize
> > > ticket
> > > > > and start the discussion around this topic? We need some user story
> > or
> > > > the
> > > > > community is not sure that DL is really helpful? Why the discussion
> > > with
> > > > > Adam
> > > > > Gibson just finished with no implementation of any idea? What
> > concerns
> > > do
> > > > > we have?
> > > > >
> > > > > пн, 6 февр. 2017 г. в 15:01, Theodore Vasiloudis <
> > > > > theodoros.vasiloudis@gmail.com>:
> > > > >
> > > > > > Hell all,
> > > > > >
> > > > > > This is point that has come up in the past: Given the multitude
> of
> > ML
> > > > > > libraries out there, should we have native implementations in
> > FlinkML
> > > > or
> > > > > > try to integrate other libraries instead?
> > > > > >
> > > > > > We haven't managed to reach a consensus on this before. My
> opinion
> > is
> > > > > that
> > > > > > there is definitely value in having ML algorithms written
> natively
> > in
> > > > > > Flink, both for performance optimization,
> > > > > > but more importantly for engineering simplicity, we don't want to
> > > force
> > > > > > users to use yet another piece of software to run their ML algos
> > (at
> > > > > least
> > > > > > for a basic set of algorithms).
> > > > > >
> > > > > > We have in the past  discussed integrations with DL4J
> (particularly
> > > > ND4J)
> > > > > > with Adam Gibson, the core developer of the library, but we never
> > got
> > > > > > around to implementing anything.
> > > > > >
> > > > > > Whether it makes sense to have an integration with DL4J as part
> of
> > > the
> > > > > > Flink distribution would be up for discussion. I would suggest to
> > > make
> > > > it
> > > > > > an independent repo to allow for
> > > > > > faster dev/release cycles, and because it wouldn't be directly
> > > related
> > > > to
> > > > > > the core of Flink so it would add extra reviewing burden to an
> > > already
> > > > > > overloaded group of committers.
> > > > > >
> > > > > > Natively supporting GPU calculations in Flink would be much
> better
> > > > > achieved
> > > > > > through a library like ND4J, the engineering burden would be too
> > much
> > > > > > otherwise.
> > > > > >
> > > > > > Regards,
> > > > > > Theodore
> > > > > >
> > > > > > On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <
> > > katherinmail@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello, guys.
> > > > > > >
> > > > > > > Theodore, last week I started the review of the PR:
> > > > > > > https://github.com/apache/flink/pull/2735 related to *word2Vec
> > for
> > > > > > Flink*.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > During this review I have asked myself: why do we need to
> > implement
> > > > > such
> > > > > > a
> > > > > > > very popular algorithm like *word2vec one more time*, when
> there
> > is
> > > > > > already
> > > > > > > available implementation in java provided by
> deeplearning4j.org
> > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache
> 2
> > > > > > licence).
> > > > > > > This library tries to promote itself, there is a hype around it
> > in
> > > ML
> > > > > > > sphere, and it was integrated with Apache Spark, to provide
> > > scalable
> > > > > > > deeplearning calculations.
> > > > > > >
> > > > > > >
> > > > > > > *That's why I thought: could we integrate with this library or
> > not
> > > > also
> > > > > > and
> > > > > > > Flink? *
> > > > > > >
> > > > > > > 1) Personally I think, providing support and deployment of
> > > > > > > *Deeplearning(DL)
> > > > > > > algorithms/models in Flink* is promising and attractive
> feature,
> > > > > because:
> > > > > > >
> > > > > > >     a) during last two years DL proved its efficiency and these
> > > > > > algorithms
> > > > > > > used in many applications. For example *Spotify *uses DL based
> > > > > algorithms
> > > > > > > for music content extraction: Recommending music on Spotify
> with
> > > deep
> > > > > > > learning AUGUST 05, 2014
> > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for
> > their
> > > > > music
> > > > > > > recommendations. Developers need to scale up DL manually, that
> > > causes
> > > > a
> > > > > > lot
> > > > > > > of work, so that’s why such platforms like Flink should support
> > > these
> > > > > > > models deployment.
> > > > > > >
> > > > > > >     b) Here is presented the scope of Deeplearning usage cases
> > > > > > > <https://deeplearning4j.org/use_cases>, so many of this
> > scenarios
> > > > > > related
> > > > > > > to scenarios, that could be supported on Flink.
> > > > > > >
> > > > > > >
> > > > > > > 2) But DL uncover such questions like:
> > > > > > >
> > > > > > >     a) scale up calculations over machines
> > > > > > >
> > > > > > >     b) perform these calculations both over CPU and GPU. GPU is
> > > > > required
> > > > > > to
> > > > > > > train big DL models, otherwise learning process could have very
> > > slow
> > > > > > > convergence.
> > > > > > >
> > > > > > >
> > > > > > > 3) I have checked this DL4J library, which already have reach
> > > support
> > > > > of
> > > > > > > many attractive DL models like: Recurrent Networks and LSTMs,
> > > > > > Convolutional
> > > > > > > Networks (CNN), Restricted Boltzmann Machines (RBM) and others.
> > So
> > > we
> > > > > > won’t
> > > > > > > need to implement them independently, but only provide the
> > ability
> > > of
> > > > > > > execution of this models over Flink cluster, the quite similar
> > way
> > > > like
> > > > > > it
> > > > > > > was integrated with Apache Spark.
> > > > > > >
> > > > > > >
> > > > > > > Because of all of this I propose:
> > > > > > >
> > > > > > > 1)    To create new ticket in Flink’s JIRA for integration of
> > Flink
> > > > > with
> > > > > > > DL4J and decide on which side this integration should be
> > > implemented.
> > > > > > >
> > > > > > > 2)    Support natively GPU resources in Flink and allow
> > > calculations
> > > > > over
> > > > > > > them, like that is described in this publication
> > > > > > > https://www.oreilly.com/learning/accelerating-spark-
> > > > > workloads-using-gpus
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Regarding original issue Implement Word2Vec
> > > > > > > <https://issues.apache.org/jira/browse/FLINK-2094>in Flink,
> *I
> > > have
> > > > > > > investigated its implementation in DL4J and  that
> implementation
> > of
> > > > > > > integration DL4J with Apache Spark, and got several points:
> > > > > > >
> > > > > > > It seems that idea of building of our own implementation of
> > > word2vec
> > > > in
> > > > > > > Flink not such a bad solution, because: This DL4J was forced to
> > > > > > reimplement
> > > > > > > its original word2Vec over Spark. I have checked the
> integration
> > of
> > > > > DL4J
> > > > > > > with Spark, and found that it is too strongly coupled with
> Spark
> > > API,
> > > > > so
> > > > > > > that it is impossible just to take some DL4J API and reuse it,
> > > > instead
> > > > > we
> > > > > > > need to implement independent integration for Flink.
> > > > > > >
> > > > > > > *That’s why we simply finish implementation of current PR
> > > > > > > **independently **from
> > > > > > > integration to DL4J.*
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Could you please provide your opinion regarding my questions
> and
> > > > > points,
> > > > > > > what do you think about them?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > пн, 6 февр. 2017 г. в 12:51, Katherin Eri <
> > katherinmail@gmail.com
> > > >:
> > > > > > >
> > > > > > > > Sorry, guys I need to finish this letter first.
> > > > > > > >   Full version of it will come shortly.
> > > > > > > >
> > > > > > > > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <
> > > katherinmail@gmail.com
> > > > >:
> > > > > > > >
> > > > > > > > Hello, guys.
> > > > > > > > Theodore, last week I started the review of the PR:
> > > > > > > > https://github.com/apache/flink/pull/2735 related to
> *word2Vec
> > > for
> > > > > > > Flink*.
> > > > > > > >
> > > > > > > > During this review I have asked myself: why do we need to
> > > implement
> > > > > > such
> > > > > > > a
> > > > > > > > very popular algorithm like *word2vec one more time*, when
> > there
> > > is
> > > > > > > > already availabe implementation in java provided by
> > > > > deeplearning4j.org
> > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J ->
> Apache
> > 2
> > > > > > > licence).
> > > > > > > > This library tries to promote it self, there is a hype around
> > it
> > > in
> > > > > ML
> > > > > > > > sphere, and  it was integrated with Apache Spark, to provide
> > > > scalable
> > > > > > > > deeplearning calculations.
> > > > > > > > That's why I thought: could we integrate with this library or
> > not
> > > > > also
> > > > > > > and
> > > > > > > > Flink?
> > > > > > > > 1) Personally I think, providing support and deployment of
> > > > > Deeplearning
> > > > > > > > algorithms/models in Flink is promising and attractive
> feature,
> > > > > > because:
> > > > > > > >     a) during last two years deeplearning proved its
> efficiency
> > > and
> > > > > > this
> > > > > > > > algorithms used in many applications. For example *Spotify
> > *uses
> > > DL
> > > > > > based
> > > > > > > > algorithms for music content extraction: Recommending music
> on
> > > > > Spotify
> > > > > > > > with deep learning AUGUST 05, 2014
> > > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for
> > > their
> > > > > > music
> > > > > > > > recommendations. Doing this natively scalable is very
> > attractive.
> > > > > > > >
> > > > > > > >
> > > > > > > > I have investigated that implementation of integration DL4J
> > with
> > > > > Apache
> > > > > > > > Spark, and got several points:
> > > > > > > >
> > > > > > > > 1) It seems that idea of building of our own implementation
> of
> > > > > word2vec
> > > > > > > > not such a bad solution, because the integration of DL4J with
> > > Spark
> > > > > is
> > > > > > > too
> > > > > > > > strongly coupled with Saprk API and it will take time from
> the
> > > side
> > > > > of
> > > > > > > DL4J
> > > > > > > > to adopt this integration to Flink. Also I have expected that
> > we
> > > > will
> > > > > > be
> > > > > > > > able to call just some API, it is not such thing.
> > > > > > > > 2)
> > > > > > > >
> > > > > > > > https://deeplearning4j.org/use_cases
> > > > > > > > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> > > > > > > implementation-r-python/
> > > > > > > >
> > > > > > > >
> > > > > > > > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <
> > trohrmann@apache.org
> > > >:
> > > > > > > >
> > > > > > > > Hi Katherin,
> > > > > > > >
> > > > > > > > welcome to the Flink community. Always great to see new
> people
> > > > > joining
> > > > > > > the
> > > > > > > > community :-)
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Till
> > > > > > > >
> > > > > > > > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> > > > > > > katherinmail@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > ok, I've got it.
> > > > > > > > > I will take a look at
> > > https://github.com/apache/flink/pull/2735
> > > > .
> > > > > > > > >
> > > > > > > > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > > > > > > > > theodoros.vasiloudis@gmail.com>:
> > > > > > > > >
> > > > > > > > > > Hello Katherin,
> > > > > > > > > >
> > > > > > > > > > Welcome to the Flink community!
> > > > > > > > > >
> > > > > > > > > > The ML component definitely needs a lot of work you are
> > > > correct,
> > > > > we
> > > > > > > are
> > > > > > > > > > facing similar problems to CEP, which we'll hopefully
> > resolve
> > > > > with
> > > > > > > the
> > > > > > > > > > restructuring Stephan has mentioned in that thread.
> > > > > > > > > >
> > > > > > > > > > If you'd like to help out with PRs we have many open,
> one I
> > > > have
> > > > > > > > started
> > > > > > > > > > reviewing but got side-tracked is the Word2Vec one [1].
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Theodore
> > > > > > > > > >
> > > > > > > > > > [1] https://github.com/apache/flink/pull/2735
> > > > > > > > > >
> > > > > > > > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <
> > > > > fhueske@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Katherin,
> > > > > > > > > > >
> > > > > > > > > > > welcome to the Flink community!
> > > > > > > > > > > Help with reviewing PRs is always very welcome and a
> > great
> > > > way
> > > > > to
> > > > > > > > > > > contribute.
> > > > > > > > > > >
> > > > > > > > > > > Best, Fabian
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> > > > > > > katherinmail@gmail.com
> > > > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Thank you, Timo.
> > > > > > > > > > > > I have started the analysis of the topic.
> > > > > > > > > > > > And if it necessary, I will try to perform the review
> > of
> > > > > other
> > > > > > > > pulls)
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <
> > > > > twalthr@apache.org
> > > > > > >:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Katherin,
> > > > > > > > > > > > >
> > > > > > > > > > > > > great to hear that you would like to contribute!
> > > Welcome!
> > > > > > > > > > > > >
> > > > > > > > > > > > > I gave you contributor permissions. You can now
> > assign
> > > > > issues
> > > > > > > to
> > > > > > > > > > > > > yourself. I assigned FLINK-1750 to you.
> > > > > > > > > > > > > Right now there are many open ML pull requests, you
> > are
> > > > > very
> > > > > > > > > welcome
> > > > > > > > > > to
> > > > > > > > > > > > > review the code of others, too.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Timo
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> > > > > > > > > > > > > > Hello, All!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I'm Kate Eri, I'm java developer with 6-year
> > > enterprise
> > > > > > > > > experience,
> > > > > > > > > > > > also
> > > > > > > > > > > > > I
> > > > > > > > > > > > > > have some expertise with scala (half of the
> year).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Last 2 years I have participated in several
> BigData
> > > > > > projects
> > > > > > > > that
> > > > > > > > > > > were
> > > > > > > > > > > > > > related to Machine Learning (Time series
> analysis,
> > > > > > > Recommender
> > > > > > > > > > > systems,
> > > > > > > > > > > > > > Social networking) and ETL. I have experience
> with
> > > > > Hadoop,
> > > > > > > > Apache
> > > > > > > > > > > Spark
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > Hive.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I’m fond of ML topic, and I see that Flink
> project
> > > > > requires
> > > > > > > > some
> > > > > > > > > > work
> > > > > > > > > > > > in
> > > > > > > > > > > > > > this area, so that’s why I would like to join
> Flink
> > > and
> > > > > ask
> > > > > > > me
> > > > > > > > to
> > > > > > > > > > > grant
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > assignment of the ticket
> > > > > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-1750
> > > > > > > > > > > > > > to me.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: New Flink team member - Kate Eri.

Posted by Felix Neutatz <ne...@googlemail.com.INVALID>.

Hi Kate,

1) - Broadcast:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-5%3A+Only+send+data+to+each+taskmanager+once+for+broadcasts
 - Caching: https://issues.apache.org/jira/browse/FLINK-1730

2) I have no idea about the GPU implementation. The SystemML mailing list
will probably help you out their.

Best regards,
Felix

2017-02-08 14:33 GMT+01:00 Katherin Eri <ka...@gmail.com>:

> Thank you Felix, for your point, it is quite interesting.
>
> I will take a look at the code, of the provided Flink integration.
>
> 1)    You have these problems with Flink: >>we realized that the lack of a
> caching operator and a broadcast issue highly effects the performance, have
> you already asked about this the community? In case yes: please provide the
> reference to the ticket or the topic of letter.
>
> 2)    You have said, that SystemML provides GPU support. I have seen
> SystemML’s source code and would like to ask: why you have decided to
> implement your own integration with cuda? Did you try to consider ND4J, or
> because it is younger, you support your own implementation?
>
> вт, 7 февр. 2017 г. в 18:35, Felix Neutatz <ne...@googlemail.com>:
>
> > Hi Katherin,
> >
> > we are also working in a similar direction. We implemented a prototype to
> > integrate with SystemML:
> > https://github.com/apache/incubator-systemml/pull/119
> > SystemML provides many different matrix formats, operations, GPU support
> > and a couple of DL algorithms. Unfortunately, we realized that the lack
> of
> > a caching operator and a broadcast issue highly effects the performance
> > (e.g. compared to Spark). At the moment I am trying to tackle the
> broadcast
> > issue. But caching is still a problem for us.
> >
> > Best regards,
> > Felix
> >
> > 2017-02-07 16:22 GMT+01:00 Katherin Eri <ka...@gmail.com>:
> >
> > > Thank you, Till.
> > >
> > > 1)      Regarding ND4J, I didn’t know about such a pity and critical
> > > restriction of it -> lack of sparsity optimizations, and you are right:
> > > this issue is still actual for them. I saw that Flink uses Breeze, but
> I
> > > thought its usage caused by some historical reasons.
> > >
> > > 2)      Regarding integration with DL4J, I have read the source code of
> > > DL4J/Spark integration, that’s why I have declined my idea of reuse of
> > > their word2vec implementation for now, for example. I can perform
> deeper
> > > investigation of this topic, if it required.
> > >
> > >
> > >
> > > So I feel that we have the following picture:
> > >
> > > 1)      DL integration investigation, could be part of Apache Bahir. I
> > can
> > > perform futher investigation of this topic, but I thik we need some
> > > separated ticket for this to track this activity.
> > >
> > > 2)      GPU support, required for DL is interesting, but requires ND4J
> > for
> > > example.
> > >
> > > 3)      ND4J couldn’t be incorporated because it doesn’t support
> sparsity
> > > <https://deeplearning4j.org/roadmap.html> [1].
> > >
> > > Regarding ND4J is this the single blocker for incorporation of it or
> may
> > be
> > > some others known?
> > >
> > >
> > > [1] https://deeplearning4j.org/roadmap.html
> > >
> > > вт, 7 февр. 2017 г. в 16:26, Till Rohrmann <tr...@apache.org>:
> > >
> > > Thanks for initiating this discussion Katherin. I think you're right
> that
> > > in general it does not make sense to reinvent the wheel over and over
> > > again. Especially if you only have limited resources at hand. So if we
> > > could integrate Flink with some existing library that would be great.
> > >
> > > In the past, however, we couldn't find a good library which provided
> > enough
> > > freedom to integrate it with Flink. Especially if you want to have
> > > distributed and somewhat high-performance implementations of ML
> > algorithms
> > > you would have to take Flink's execution model (capabilities as well as
> > > limitations) into account. That is mainly the reason why we started
> > > implementing some of the algorithms "natively" on Flink.
> > >
> > > If I remember correctly, then the problem with ND4J was and still is
> that
> > > it does not support sparse matrices which was a requirement from our
> > side.
> > > As far as I know, it is quite common that you have sparse data
> structures
> > > when dealing with large scale problems. That's why we built our own
> > > abstraction which can have different implementations. Currently, the
> > > default implementation uses Breeze.
> > >
> > > I think the support for GPU based operations and the actual resource
> > > management are two orthogonal things. The implementation would have to
> > work
> > > with no GPUs available anyway. If the system detects that GPUs are
> > > available, then ideally it would exploit them. Thus, we could add this
> > > feature later and maybe integrate it with FLINK-5131 [1].
> > >
> > > Concerning the integration with DL4J I think that Theo's proposal to do
> > it
> > > in a separate repository (maybe as part of Apache Bahir) is a good
> idea.
> > > We're currently thinking about outsourcing some of Flink's libraries
> into
> > > sub projects. This could also be an option for the DL4J integration
> then.
> > > In general I think it should be feasible to run DL4J on Flink given
> that
> > it
> > > also runs on Spark. Have you already looked at it closer?
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-5131
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Feb 7, 2017 at 11:47 AM, Katherin Eri <ka...@gmail.com>
> > > wrote:
> > >
> > > > Thank you Theodore, for your reply.
> > > >
> > > > 1)    Regarding GPU, your point is clear and I agree with it, ND4J
> > looks
> > > > appropriate. But, my current understanding is that, we also need to
> > cover
> > > > some resource management questions -> when we need to provide GPU
> > support
> > > > we also need to manage it like resource. For example, Mesos has
> already
> > > > supported GPU like resource item: Initial support for GPU resources.
> > > > <https://issues.apache.org/jira/browse/MESOS-4424?jql=text%20~%20GPU
> >
> > > > Flink
> > > > uses Mesos as cluster manager, and this means that this feature of
> > Mesos
> > > > could be reused. Also memory managing questions in Flink regarding
> GPU
> > > > should be clarified.
> > > >
> > > > 2)    Regarding integration with DL4J: what stops us to initialize
> > ticket
> > > > and start the discussion around this topic? We need some user story
> or
> > > the
> > > > community is not sure that DL is really helpful? Why the discussion
> > with
> > > > Adam
> > > > Gibson just finished with no implementation of any idea? What
> concerns
> > do
> > > > we have?
> > > >
> > > > пн, 6 февр. 2017 г. в 15:01, Theodore Vasiloudis <
> > > > theodoros.vasiloudis@gmail.com>:
> > > >
> > > > > Hell all,
> > > > >
> > > > > This is point that has come up in the past: Given the multitude of
> ML
> > > > > libraries out there, should we have native implementations in
> FlinkML
> > > or
> > > > > try to integrate other libraries instead?
> > > > >
> > > > > We haven't managed to reach a consensus on this before. My opinion
> is
> > > > that
> > > > > there is definitely value in having ML algorithms written natively
> in
> > > > > Flink, both for performance optimization,
> > > > > but more importantly for engineering simplicity, we don't want to
> > force
> > > > > users to use yet another piece of software to run their ML algos
> (at
> > > > least
> > > > > for a basic set of algorithms).
> > > > >
> > > > > We have in the past  discussed integrations with DL4J (particularly
> > > ND4J)
> > > > > with Adam Gibson, the core developer of the library, but we never
> got
> > > > > around to implementing anything.
> > > > >
> > > > > Whether it makes sense to have an integration with DL4J as part of
> > the
> > > > > Flink distribution would be up for discussion. I would suggest to
> > make
> > > it
> > > > > an independent repo to allow for
> > > > > faster dev/release cycles, and because it wouldn't be directly
> > related
> > > to
> > > > > the core of Flink so it would add extra reviewing burden to an
> > already
> > > > > overloaded group of committers.
> > > > >
> > > > > Natively supporting GPU calculations in Flink would be much better
> > > > achieved
> > > > > through a library like ND4J, the engineering burden would be too
> much
> > > > > otherwise.
> > > > >
> > > > > Regards,
> > > > > Theodore
> > > > >
> > > > > On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <
> > katherinmail@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hello, guys.
> > > > > >
> > > > > > Theodore, last week I started the review of the PR:
> > > > > > https://github.com/apache/flink/pull/2735 related to *word2Vec
> for
> > > > > Flink*.
> > > > > >
> > > > > >
> > > > > >
> > > > > > During this review I have asked myself: why do we need to
> implement
> > > > such
> > > > > a
> > > > > > very popular algorithm like *word2vec one more time*, when there
> is
> > > > > already
> > > > > > available implementation in java provided by deeplearning4j.org
> > > > > > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2
> > > > > licence).
> > > > > > This library tries to promote itself, there is a hype around it
> in
> > ML
> > > > > > sphere, and it was integrated with Apache Spark, to provide
> > scalable
> > > > > > deeplearning calculations.
> > > > > >
> > > > > >
> > > > > > *That's why I thought: could we integrate with this library or
> not
> > > also
> > > > > and
> > > > > > Flink? *
> > > > > >
> > > > > > 1) Personally I think, providing support and deployment of
> > > > > > *Deeplearning(DL)
> > > > > > algorithms/models in Flink* is promising and attractive feature,
> > > > because:
> > > > > >
> > > > > >     a) during last two years DL proved its efficiency and these
> > > > > algorithms
> > > > > > used in many applications. For example *Spotify *uses DL based
> > > > algorithms
> > > > > > for music content extraction: Recommending music on Spotify with
> > deep
> > > > > > learning AUGUST 05, 2014
> > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for
> their
> > > > music
> > > > > > recommendations. Developers need to scale up DL manually, that
> > causes
> > > a
> > > > > lot
> > > > > > of work, so that’s why such platforms like Flink should support
> > these
> > > > > > models deployment.
> > > > > >
> > > > > >     b) Here is presented the scope of Deeplearning usage cases
> > > > > > <https://deeplearning4j.org/use_cases>, so many of this
> scenarios
> > > > > related
> > > > > > to scenarios, that could be supported on Flink.
> > > > > >
> > > > > >
> > > > > > 2) But DL uncover such questions like:
> > > > > >
> > > > > >     a) scale up calculations over machines
> > > > > >
> > > > > >     b) perform these calculations both over CPU and GPU. GPU is
> > > > required
> > > > > to
> > > > > > train big DL models, otherwise learning process could have very
> > slow
> > > > > > convergence.
> > > > > >
> > > > > >
> > > > > > 3) I have checked this DL4J library, which already have reach
> > support
> > > > of
> > > > > > many attractive DL models like: Recurrent Networks and LSTMs,
> > > > > Convolutional
> > > > > > Networks (CNN), Restricted Boltzmann Machines (RBM) and others.
> So
> > we
> > > > > won’t
> > > > > > need to implement them independently, but only provide the
> ability
> > of
> > > > > > execution of this models over Flink cluster, the quite similar
> way
> > > like
> > > > > it
> > > > > > was integrated with Apache Spark.
> > > > > >
> > > > > >
> > > > > > Because of all of this I propose:
> > > > > >
> > > > > > 1)    To create new ticket in Flink’s JIRA for integration of
> Flink
> > > > with
> > > > > > DL4J and decide on which side this integration should be
> > implemented.
> > > > > >
> > > > > > 2)    Support natively GPU resources in Flink and allow
> > calculations
> > > > over
> > > > > > them, like that is described in this publication
> > > > > > https://www.oreilly.com/learning/accelerating-spark-
> > > > workloads-using-gpus
> > > > > >
> > > > > >
> > > > > >
> > > > > > *Regarding original issue Implement Word2Vec
> > > > > > <https://issues.apache.org/jira/browse/FLINK-2094>in Flink,  *I
> > have
> > > > > > investigated its implementation in DL4J and  that implementation
> of
> > > > > > integration DL4J with Apache Spark, and got several points:
> > > > > >
> > > > > > It seems that idea of building of our own implementation of
> > word2vec
> > > in
> > > > > > Flink not such a bad solution, because: This DL4J was forced to
> > > > > reimplement
> > > > > > its original word2Vec over Spark. I have checked the integration
> of
> > > > DL4J
> > > > > > with Spark, and found that it is too strongly coupled with Spark
> > API,
> > > > so
> > > > > > that it is impossible just to take some DL4J API and reuse it,
> > > instead
> > > > we
> > > > > > need to implement independent integration for Flink.
> > > > > >
> > > > > > *That’s why we simply finish implementation of current PR
> > > > > > **independently **from
> > > > > > integration to DL4J.*
> > > > > >
> > > > > >
> > > > > >
> > > > > > Could you please provide your opinion regarding my questions and
> > > > points,
> > > > > > what do you think about them?
> > > > > >
> > > > > >
> > > > > >
> > > > > > пн, 6 февр. 2017 г. в 12:51, Katherin Eri <
> katherinmail@gmail.com
> > >:
> > > > > >
> > > > > > > Sorry, guys I need to finish this letter first.
> > > > > > >   Full version of it will come shortly.
> > > > > > >
> > > > > > > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <
> > katherinmail@gmail.com
> > > >:
> > > > > > >
> > > > > > > Hello, guys.
> > > > > > > Theodore, last week I started the review of the PR:
> > > > > > > https://github.com/apache/flink/pull/2735 related to *word2Vec
> > for
> > > > > > Flink*.
> > > > > > >
> > > > > > > During this review I have asked myself: why do we need to
> > implement
> > > > > such
> > > > > > a
> > > > > > > very popular algorithm like *word2vec one more time*, when
> there
> > is
> > > > > > > already availabe implementation in java provided by
> > > > deeplearning4j.org
> > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache
> 2
> > > > > > licence).
> > > > > > > This library tries to promote it self, there is a hype around
> it
> > in
> > > > ML
> > > > > > > sphere, and  it was integrated with Apache Spark, to provide
> > > scalable
> > > > > > > deeplearning calculations.
> > > > > > > That's why I thought: could we integrate with this library or
> not
> > > > also
> > > > > > and
> > > > > > > Flink?
> > > > > > > 1) Personally I think, providing support and deployment of
> > > > Deeplearning
> > > > > > > algorithms/models in Flink is promising and attractive feature,
> > > > > because:
> > > > > > >     a) during last two years deeplearning proved its efficiency
> > and
> > > > > this
> > > > > > > algorithms used in many applications. For example *Spotify
> *uses
> > DL
> > > > > based
> > > > > > > algorithms for music content extraction: Recommending music on
> > > > Spotify
> > > > > > > with deep learning AUGUST 05, 2014
> > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for
> > their
> > > > > music
> > > > > > > recommendations. Doing this natively scalable is very
> attractive.
> > > > > > >
> > > > > > >
> > > > > > > I have investigated that implementation of integration DL4J
> with
> > > > Apache
> > > > > > > Spark, and got several points:
> > > > > > >
> > > > > > > 1) It seems that idea of building of our own implementation of
> > > > word2vec
> > > > > > > not such a bad solution, because the integration of DL4J with
> > Spark
> > > > is
> > > > > > too
> > > > > > > strongly coupled with Saprk API and it will take time from the
> > side
> > > > of
> > > > > > DL4J
> > > > > > > to adopt this integration to Flink. Also I have expected that
> we
> > > will
> > > > > be
> > > > > > > able to call just some API, it is not such thing.
> > > > > > > 2)
> > > > > > >
> > > > > > > https://deeplearning4j.org/use_cases
> > > > > > > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> > > > > > implementation-r-python/
> > > > > > >
> > > > > > >
> > > > > > > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <
> trohrmann@apache.org
> > >:
> > > > > > >
> > > > > > > Hi Katherin,
> > > > > > >
> > > > > > > welcome to the Flink community. Always great to see new people
> > > > joining
> > > > > > the
> > > > > > > community :-)
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> > > > > > katherinmail@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > ok, I've got it.
> > > > > > > > I will take a look at
> > https://github.com/apache/flink/pull/2735
> > > .
> > > > > > > >
> > > > > > > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > > > > > > > theodoros.vasiloudis@gmail.com>:
> > > > > > > >
> > > > > > > > > Hello Katherin,
> > > > > > > > >
> > > > > > > > > Welcome to the Flink community!
> > > > > > > > >
> > > > > > > > > The ML component definitely needs a lot of work you are
> > > correct,
> > > > we
> > > > > > are
> > > > > > > > > facing similar problems to CEP, which we'll hopefully
> resolve
> > > > with
> > > > > > the
> > > > > > > > > restructuring Stephan has mentioned in that thread.
> > > > > > > > >
> > > > > > > > > If you'd like to help out with PRs we have many open, one I
> > > have
> > > > > > > started
> > > > > > > > > reviewing but got side-tracked is the Word2Vec one [1].
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Theodore
> > > > > > > > >
> > > > > > > > > [1] https://github.com/apache/flink/pull/2735
> > > > > > > > >
> > > > > > > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <
> > > > fhueske@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Katherin,
> > > > > > > > > >
> > > > > > > > > > welcome to the Flink community!
> > > > > > > > > > Help with reviewing PRs is always very welcome and a
> great
> > > way
> > > > to
> > > > > > > > > > contribute.
> > > > > > > > > >
> > > > > > > > > > Best, Fabian
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> > > > > > katherinmail@gmail.com
> > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Thank you, Timo.
> > > > > > > > > > > I have started the analysis of the topic.
> > > > > > > > > > > And if it necessary, I will try to perform the review
> of
> > > > other
> > > > > > > pulls)
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <
> > > > twalthr@apache.org
> > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Katherin,
> > > > > > > > > > > >
> > > > > > > > > > > > great to hear that you would like to contribute!
> > Welcome!
> > > > > > > > > > > >
> > > > > > > > > > > > I gave you contributor permissions. You can now
> assign
> > > > issues
> > > > > > to
> > > > > > > > > > > > yourself. I assigned FLINK-1750 to you.
> > > > > > > > > > > > Right now there are many open ML pull requests, you
> are
> > > > very
> > > > > > > > welcome
> > > > > > > > > to
> > > > > > > > > > > > review the code of others, too.
> > > > > > > > > > > >
> > > > > > > > > > > > Timo
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> > > > > > > > > > > > > Hello, All!
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'm Kate Eri, I'm java developer with 6-year
> > enterprise
> > > > > > > > experience,
> > > > > > > > > > > also
> > > > > > > > > > > > I
> > > > > > > > > > > > > have some expertise with scala (half of the year).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Last 2 years I have participated in several BigData
> > > > > projects
> > > > > > > that
> > > > > > > > > > were
> > > > > > > > > > > > > related to Machine Learning (Time series analysis,
> > > > > > Recommender
> > > > > > > > > > systems,
> > > > > > > > > > > > > Social networking) and ETL. I have experience with
> > > > Hadoop,
> > > > > > > Apache
> > > > > > > > > > Spark
> > > > > > > > > > > > and
> > > > > > > > > > > > > Hive.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I’m fond of ML topic, and I see that Flink project
> > > > requires
> > > > > > > some
> > > > > > > > > work
> > > > > > > > > > > in
> > > > > > > > > > > > > this area, so that’s why I would like to join Flink
> > and
> > > > ask
> > > > > > me
> > > > > > > to
> > > > > > > > > > grant
> > > > > > > > > > > > the
> > > > > > > > > > > > > assignment of the ticket
> > > > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-1750
> > > > > > > > > > > > > to me.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: New Flink team member - Kate Eri.

Posted by Katherin Eri <ka...@gmail.com>.

Thank you Felix, for your point, it is quite interesting.

I will take a look at the code, of the provided Flink integration.

1)    You have these problems with Flink: >>we realized that the lack of a
caching operator and a broadcast issue highly effects the performance, have
you already asked about this the community? In case yes: please provide the
reference to the ticket or the topic of letter.

2)    You have said, that SystemML provides GPU support. I have seen
SystemML’s source code and would like to ask: why you have decided to
implement your own integration with cuda? Did you try to consider ND4J, or
because it is younger, you support your own implementation?

вт, 7 февр. 2017 г. в 18:35, Felix Neutatz <ne...@googlemail.com>:

> Hi Katherin,
>
> we are also working in a similar direction. We implemented a prototype to
> integrate with SystemML:
> https://github.com/apache/incubator-systemml/pull/119
> SystemML provides many different matrix formats, operations, GPU support
> and a couple of DL algorithms. Unfortunately, we realized that the lack of
> a caching operator and a broadcast issue highly effects the performance
> (e.g. compared to Spark). At the moment I am trying to tackle the broadcast
> issue. But caching is still a problem for us.
>
> Best regards,
> Felix
>
> 2017-02-07 16:22 GMT+01:00 Katherin Eri <ka...@gmail.com>:
>
> > Thank you, Till.
> >
> > 1)      Regarding ND4J, I didn’t know about such a pity and critical
> > restriction of it -> lack of sparsity optimizations, and you are right:
> > this issue is still actual for them. I saw that Flink uses Breeze, but I
> > thought its usage caused by some historical reasons.
> >
> > 2)      Regarding integration with DL4J, I have read the source code of
> > DL4J/Spark integration, that’s why I have declined my idea of reuse of
> > their word2vec implementation for now, for example. I can perform deeper
> > investigation of this topic, if it required.
> >
> >
> >
> > So I feel that we have the following picture:
> >
> > 1)      DL integration investigation, could be part of Apache Bahir. I
> can
> > perform futher investigation of this topic, but I thik we need some
> > separated ticket for this to track this activity.
> >
> > 2)      GPU support, required for DL is interesting, but requires ND4J
> for
> > example.
> >
> > 3)      ND4J couldn’t be incorporated because it doesn’t support sparsity
> > <https://deeplearning4j.org/roadmap.html> [1].
> >
> > Regarding ND4J is this the single blocker for incorporation of it or may
> be
> > some others known?
> >
> >
> > [1] https://deeplearning4j.org/roadmap.html
> >
> > вт, 7 февр. 2017 г. в 16:26, Till Rohrmann <tr...@apache.org>:
> >
> > Thanks for initiating this discussion Katherin. I think you're right that
> > in general it does not make sense to reinvent the wheel over and over
> > again. Especially if you only have limited resources at hand. So if we
> > could integrate Flink with some existing library that would be great.
> >
> > In the past, however, we couldn't find a good library which provided
> enough
> > freedom to integrate it with Flink. Especially if you want to have
> > distributed and somewhat high-performance implementations of ML
> algorithms
> > you would have to take Flink's execution model (capabilities as well as
> > limitations) into account. That is mainly the reason why we started
> > implementing some of the algorithms "natively" on Flink.
> >
> > If I remember correctly, then the problem with ND4J was and still is that
> > it does not support sparse matrices which was a requirement from our
> side.
> > As far as I know, it is quite common that you have sparse data structures
> > when dealing with large scale problems. That's why we built our own
> > abstraction which can have different implementations. Currently, the
> > default implementation uses Breeze.
> >
> > I think the support for GPU based operations and the actual resource
> > management are two orthogonal things. The implementation would have to
> work
> > with no GPUs available anyway. If the system detects that GPUs are
> > available, then ideally it would exploit them. Thus, we could add this
> > feature later and maybe integrate it with FLINK-5131 [1].
> >
> > Concerning the integration with DL4J I think that Theo's proposal to do
> it
> > in a separate repository (maybe as part of Apache Bahir) is a good idea.
> > We're currently thinking about outsourcing some of Flink's libraries into
> > sub projects. This could also be an option for the DL4J integration then.
> > In general I think it should be feasible to run DL4J on Flink given that
> it
> > also runs on Spark. Have you already looked at it closer?
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-5131
> >
> > Cheers,
> > Till
> >
> > On Tue, Feb 7, 2017 at 11:47 AM, Katherin Eri <ka...@gmail.com>
> > wrote:
> >
> > > Thank you Theodore, for your reply.
> > >
> > > 1)    Regarding GPU, your point is clear and I agree with it, ND4J
> looks
> > > appropriate. But, my current understanding is that, we also need to
> cover
> > > some resource management questions -> when we need to provide GPU
> support
> > > we also need to manage it like resource. For example, Mesos has already
> > > supported GPU like resource item: Initial support for GPU resources.
> > > <https://issues.apache.org/jira/browse/MESOS-4424?jql=text%20~%20GPU>
> > > Flink
> > > uses Mesos as cluster manager, and this means that this feature of
> Mesos
> > > could be reused. Also memory managing questions in Flink regarding GPU
> > > should be clarified.
> > >
> > > 2)    Regarding integration with DL4J: what stops us to initialize
> ticket
> > > and start the discussion around this topic? We need some user story or
> > the
> > > community is not sure that DL is really helpful? Why the discussion
> with
> > > Adam
> > > Gibson just finished with no implementation of any idea? What concerns
> do
> > > we have?
> > >
> > > пн, 6 февр. 2017 г. в 15:01, Theodore Vasiloudis <
> > > theodoros.vasiloudis@gmail.com>:
> > >
> > > > Hell all,
> > > >
> > > > This is point that has come up in the past: Given the multitude of ML
> > > > libraries out there, should we have native implementations in FlinkML
> > or
> > > > try to integrate other libraries instead?
> > > >
> > > > We haven't managed to reach a consensus on this before. My opinion is
> > > that
> > > > there is definitely value in having ML algorithms written natively in
> > > > Flink, both for performance optimization,
> > > > but more importantly for engineering simplicity, we don't want to
> force
> > > > users to use yet another piece of software to run their ML algos (at
> > > least
> > > > for a basic set of algorithms).
> > > >
> > > > We have in the past  discussed integrations with DL4J (particularly
> > ND4J)
> > > > with Adam Gibson, the core developer of the library, but we never got
> > > > around to implementing anything.
> > > >
> > > > Whether it makes sense to have an integration with DL4J as part of
> the
> > > > Flink distribution would be up for discussion. I would suggest to
> make
> > it
> > > > an independent repo to allow for
> > > > faster dev/release cycles, and because it wouldn't be directly
> related
> > to
> > > > the core of Flink so it would add extra reviewing burden to an
> already
> > > > overloaded group of committers.
> > > >
> > > > Natively supporting GPU calculations in Flink would be much better
> > > achieved
> > > > through a library like ND4J, the engineering burden would be too much
> > > > otherwise.
> > > >
> > > > Regards,
> > > > Theodore
> > > >
> > > > On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <
> katherinmail@gmail.com>
> > > > wrote:
> > > >
> > > > > Hello, guys.
> > > > >
> > > > > Theodore, last week I started the review of the PR:
> > > > > https://github.com/apache/flink/pull/2735 related to *word2Vec for
> > > > Flink*.
> > > > >
> > > > >
> > > > >
> > > > > During this review I have asked myself: why do we need to implement
> > > such
> > > > a
> > > > > very popular algorithm like *word2vec one more time*, when there is
> > > > already
> > > > > available implementation in java provided by deeplearning4j.org
> > > > > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2
> > > > licence).
> > > > > This library tries to promote itself, there is a hype around it in
> ML
> > > > > sphere, and it was integrated with Apache Spark, to provide
> scalable
> > > > > deeplearning calculations.
> > > > >
> > > > >
> > > > > *That's why I thought: could we integrate with this library or not
> > also
> > > > and
> > > > > Flink? *
> > > > >
> > > > > 1) Personally I think, providing support and deployment of
> > > > > *Deeplearning(DL)
> > > > > algorithms/models in Flink* is promising and attractive feature,
> > > because:
> > > > >
> > > > >     a) during last two years DL proved its efficiency and these
> > > > algorithms
> > > > > used in many applications. For example *Spotify *uses DL based
> > > algorithms
> > > > > for music content extraction: Recommending music on Spotify with
> deep
> > > > > learning AUGUST 05, 2014
> > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their
> > > music
> > > > > recommendations. Developers need to scale up DL manually, that
> causes
> > a
> > > > lot
> > > > > of work, so that’s why such platforms like Flink should support
> these
> > > > > models deployment.
> > > > >
> > > > >     b) Here is presented the scope of Deeplearning usage cases
> > > > > <https://deeplearning4j.org/use_cases>, so many of this scenarios
> > > > related
> > > > > to scenarios, that could be supported on Flink.
> > > > >
> > > > >
> > > > > 2) But DL uncover such questions like:
> > > > >
> > > > >     a) scale up calculations over machines
> > > > >
> > > > >     b) perform these calculations both over CPU and GPU. GPU is
> > > required
> > > > to
> > > > > train big DL models, otherwise learning process could have very
> slow
> > > > > convergence.
> > > > >
> > > > >
> > > > > 3) I have checked this DL4J library, which already have reach
> support
> > > of
> > > > > many attractive DL models like: Recurrent Networks and LSTMs,
> > > > Convolutional
> > > > > Networks (CNN), Restricted Boltzmann Machines (RBM) and others. So
> we
> > > > won’t
> > > > > need to implement them independently, but only provide the ability
> of
> > > > > execution of this models over Flink cluster, the quite similar way
> > like
> > > > it
> > > > > was integrated with Apache Spark.
> > > > >
> > > > >
> > > > > Because of all of this I propose:
> > > > >
> > > > > 1)    To create new ticket in Flink’s JIRA for integration of Flink
> > > with
> > > > > DL4J and decide on which side this integration should be
> implemented.
> > > > >
> > > > > 2)    Support natively GPU resources in Flink and allow
> calculations
> > > over
> > > > > them, like that is described in this publication
> > > > > https://www.oreilly.com/learning/accelerating-spark-
> > > workloads-using-gpus
> > > > >
> > > > >
> > > > >
> > > > > *Regarding original issue Implement Word2Vec
> > > > > <https://issues.apache.org/jira/browse/FLINK-2094>in Flink,  *I
> have
> > > > > investigated its implementation in DL4J and  that implementation of
> > > > > integration DL4J with Apache Spark, and got several points:
> > > > >
> > > > > It seems that idea of building of our own implementation of
> word2vec
> > in
> > > > > Flink not such a bad solution, because: This DL4J was forced to
> > > > reimplement
> > > > > its original word2Vec over Spark. I have checked the integration of
> > > DL4J
> > > > > with Spark, and found that it is too strongly coupled with Spark
> API,
> > > so
> > > > > that it is impossible just to take some DL4J API and reuse it,
> > instead
> > > we
> > > > > need to implement independent integration for Flink.
> > > > >
> > > > > *That’s why we simply finish implementation of current PR
> > > > > **independently **from
> > > > > integration to DL4J.*
> > > > >
> > > > >
> > > > >
> > > > > Could you please provide your opinion regarding my questions and
> > > points,
> > > > > what do you think about them?
> > > > >
> > > > >
> > > > >
> > > > > пн, 6 февр. 2017 г. в 12:51, Katherin Eri <katherinmail@gmail.com
> >:
> > > > >
> > > > > > Sorry, guys I need to finish this letter first.
> > > > > >   Full version of it will come shortly.
> > > > > >
> > > > > > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <
> katherinmail@gmail.com
> > >:
> > > > > >
> > > > > > Hello, guys.
> > > > > > Theodore, last week I started the review of the PR:
> > > > > > https://github.com/apache/flink/pull/2735 related to *word2Vec
> for
> > > > > Flink*.
> > > > > >
> > > > > > During this review I have asked myself: why do we need to
> implement
> > > > such
> > > > > a
> > > > > > very popular algorithm like *word2vec one more time*, when there
> is
> > > > > > already availabe implementation in java provided by
> > > deeplearning4j.org
> > > > > > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2
> > > > > licence).
> > > > > > This library tries to promote it self, there is a hype around it
> in
> > > ML
> > > > > > sphere, and  it was integrated with Apache Spark, to provide
> > scalable
> > > > > > deeplearning calculations.
> > > > > > That's why I thought: could we integrate with this library or not
> > > also
> > > > > and
> > > > > > Flink?
> > > > > > 1) Personally I think, providing support and deployment of
> > > Deeplearning
> > > > > > algorithms/models in Flink is promising and attractive feature,
> > > > because:
> > > > > >     a) during last two years deeplearning proved its efficiency
> and
> > > > this
> > > > > > algorithms used in many applications. For example *Spotify *uses
> DL
> > > > based
> > > > > > algorithms for music content extraction: Recommending music on
> > > Spotify
> > > > > > with deep learning AUGUST 05, 2014
> > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for
> their
> > > > music
> > > > > > recommendations. Doing this natively scalable is very attractive.
> > > > > >
> > > > > >
> > > > > > I have investigated that implementation of integration DL4J with
> > > Apache
> > > > > > Spark, and got several points:
> > > > > >
> > > > > > 1) It seems that idea of building of our own implementation of
> > > word2vec
> > > > > > not such a bad solution, because the integration of DL4J with
> Spark
> > > is
> > > > > too
> > > > > > strongly coupled with Saprk API and it will take time from the
> side
> > > of
> > > > > DL4J
> > > > > > to adopt this integration to Flink. Also I have expected that we
> > will
> > > > be
> > > > > > able to call just some API, it is not such thing.
> > > > > > 2)
> > > > > >
> > > > > > https://deeplearning4j.org/use_cases
> > > > > > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> > > > > implementation-r-python/
> > > > > >
> > > > > >
> > > > > > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <trohrmann@apache.org
> >:
> > > > > >
> > > > > > Hi Katherin,
> > > > > >
> > > > > > welcome to the Flink community. Always great to see new people
> > > joining
> > > > > the
> > > > > > community :-)
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> > > > > katherinmail@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > ok, I've got it.
> > > > > > > I will take a look at
> https://github.com/apache/flink/pull/2735
> > .
> > > > > > >
> > > > > > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > > > > > > theodoros.vasiloudis@gmail.com>:
> > > > > > >
> > > > > > > > Hello Katherin,
> > > > > > > >
> > > > > > > > Welcome to the Flink community!
> > > > > > > >
> > > > > > > > The ML component definitely needs a lot of work you are
> > correct,
> > > we
> > > > > are
> > > > > > > > facing similar problems to CEP, which we'll hopefully resolve
> > > with
> > > > > the
> > > > > > > > restructuring Stephan has mentioned in that thread.
> > > > > > > >
> > > > > > > > If you'd like to help out with PRs we have many open, one I
> > have
> > > > > > started
> > > > > > > > reviewing but got side-tracked is the Word2Vec one [1].
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Theodore
> > > > > > > >
> > > > > > > > [1] https://github.com/apache/flink/pull/2735
> > > > > > > >
> > > > > > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <
> > > fhueske@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Katherin,
> > > > > > > > >
> > > > > > > > > welcome to the Flink community!
> > > > > > > > > Help with reviewing PRs is always very welcome and a great
> > way
> > > to
> > > > > > > > > contribute.
> > > > > > > > >
> > > > > > > > > Best, Fabian
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> > > > > katherinmail@gmail.com
> > > > > > >:
> > > > > > > > >
> > > > > > > > > > Thank you, Timo.
> > > > > > > > > > I have started the analysis of the topic.
> > > > > > > > > > And if it necessary, I will try to perform the review of
> > > other
> > > > > > pulls)
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <
> > > twalthr@apache.org
> > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hi Katherin,
> > > > > > > > > > >
> > > > > > > > > > > great to hear that you would like to contribute!
> Welcome!
> > > > > > > > > > >
> > > > > > > > > > > I gave you contributor permissions. You can now assign
> > > issues
> > > > > to
> > > > > > > > > > > yourself. I assigned FLINK-1750 to you.
> > > > > > > > > > > Right now there are many open ML pull requests, you are
> > > very
> > > > > > > welcome
> > > > > > > > to
> > > > > > > > > > > review the code of others, too.
> > > > > > > > > > >
> > > > > > > > > > > Timo
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> > > > > > > > > > > > Hello, All!
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I'm Kate Eri, I'm java developer with 6-year
> enterprise
> > > > > > > experience,
> > > > > > > > > > also
> > > > > > > > > > > I
> > > > > > > > > > > > have some expertise with scala (half of the year).
> > > > > > > > > > > >
> > > > > > > > > > > > Last 2 years I have participated in several BigData
> > > > projects
> > > > > > that
> > > > > > > > > were
> > > > > > > > > > > > related to Machine Learning (Time series analysis,
> > > > > Recommender
> > > > > > > > > systems,
> > > > > > > > > > > > Social networking) and ETL. I have experience with
> > > Hadoop,
> > > > > > Apache
> > > > > > > > > Spark
> > > > > > > > > > > and
> > > > > > > > > > > > Hive.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I’m fond of ML topic, and I see that Flink project
> > > requires
> > > > > > some
> > > > > > > > work
> > > > > > > > > > in
> > > > > > > > > > > > this area, so that’s why I would like to join Flink
> and
> > > ask
> > > > > me
> > > > > > to
> > > > > > > > > grant
> > > > > > > > > > > the
> > > > > > > > > > > > assignment of the ticket
> > > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-1750
> > > > > > > > > > > > to me.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: New Flink team member - Kate Eri.

Posted by Felix Neutatz <ne...@googlemail.com>.

Hi Katherin,

we are also working in a similar direction. We implemented a prototype to
integrate with SystemML:
https://github.com/apache/incubator-systemml/pull/119
SystemML provides many different matrix formats, operations, GPU support
and a couple of DL algorithms. Unfortunately, we realized that the lack of
a caching operator and a broadcast issue highly effects the performance
(e.g. compared to Spark). At the moment I am trying to tackle the broadcast
issue. But caching is still a problem for us.

Best regards,
Felix

2017-02-07 16:22 GMT+01:00 Katherin Eri <ka...@gmail.com>:

> Thank you, Till.
>
> 1)      Regarding ND4J, I didn’t know about such a pity and critical
> restriction of it -> lack of sparsity optimizations, and you are right:
> this issue is still actual for them. I saw that Flink uses Breeze, but I
> thought its usage caused by some historical reasons.
>
> 2)      Regarding integration with DL4J, I have read the source code of
> DL4J/Spark integration, that’s why I have declined my idea of reuse of
> their word2vec implementation for now, for example. I can perform deeper
> investigation of this topic, if it required.
>
>
>
> So I feel that we have the following picture:
>
> 1)      DL integration investigation, could be part of Apache Bahir. I can
> perform futher investigation of this topic, but I thik we need some
> separated ticket for this to track this activity.
>
> 2)      GPU support, required for DL is interesting, but requires ND4J for
> example.
>
> 3)      ND4J couldn’t be incorporated because it doesn’t support sparsity
> <https://deeplearning4j.org/roadmap.html> [1].
>
> Regarding ND4J is this the single blocker for incorporation of it or may be
> some others known?
>
>
> [1] https://deeplearning4j.org/roadmap.html
>
> вт, 7 февр. 2017 г. в 16:26, Till Rohrmann <tr...@apache.org>:
>
> Thanks for initiating this discussion Katherin. I think you're right that
> in general it does not make sense to reinvent the wheel over and over
> again. Especially if you only have limited resources at hand. So if we
> could integrate Flink with some existing library that would be great.
>
> In the past, however, we couldn't find a good library which provided enough
> freedom to integrate it with Flink. Especially if you want to have
> distributed and somewhat high-performance implementations of ML algorithms
> you would have to take Flink's execution model (capabilities as well as
> limitations) into account. That is mainly the reason why we started
> implementing some of the algorithms "natively" on Flink.
>
> If I remember correctly, then the problem with ND4J was and still is that
> it does not support sparse matrices which was a requirement from our side.
> As far as I know, it is quite common that you have sparse data structures
> when dealing with large scale problems. That's why we built our own
> abstraction which can have different implementations. Currently, the
> default implementation uses Breeze.
>
> I think the support for GPU based operations and the actual resource
> management are two orthogonal things. The implementation would have to work
> with no GPUs available anyway. If the system detects that GPUs are
> available, then ideally it would exploit them. Thus, we could add this
> feature later and maybe integrate it with FLINK-5131 [1].
>
> Concerning the integration with DL4J I think that Theo's proposal to do it
> in a separate repository (maybe as part of Apache Bahir) is a good idea.
> We're currently thinking about outsourcing some of Flink's libraries into
> sub projects. This could also be an option for the DL4J integration then.
> In general I think it should be feasible to run DL4J on Flink given that it
> also runs on Spark. Have you already looked at it closer?
>
> [1] https://issues.apache.org/jira/browse/FLINK-5131
>
> Cheers,
> Till
>
> On Tue, Feb 7, 2017 at 11:47 AM, Katherin Eri <ka...@gmail.com>
> wrote:
>
> > Thank you Theodore, for your reply.
> >
> > 1)    Regarding GPU, your point is clear and I agree with it, ND4J looks
> > appropriate. But, my current understanding is that, we also need to cover
> > some resource management questions -> when we need to provide GPU support
> > we also need to manage it like resource. For example, Mesos has already
> > supported GPU like resource item: Initial support for GPU resources.
> > <https://issues.apache.org/jira/browse/MESOS-4424?jql=text%20~%20GPU>
> > Flink
> > uses Mesos as cluster manager, and this means that this feature of Mesos
> > could be reused. Also memory managing questions in Flink regarding GPU
> > should be clarified.
> >
> > 2)    Regarding integration with DL4J: what stops us to initialize ticket
> > and start the discussion around this topic? We need some user story or
> the
> > community is not sure that DL is really helpful? Why the discussion with
> > Adam
> > Gibson just finished with no implementation of any idea? What concerns do
> > we have?
> >
> > пн, 6 февр. 2017 г. в 15:01, Theodore Vasiloudis <
> > theodoros.vasiloudis@gmail.com>:
> >
> > > Hell all,
> > >
> > > This is point that has come up in the past: Given the multitude of ML
> > > libraries out there, should we have native implementations in FlinkML
> or
> > > try to integrate other libraries instead?
> > >
> > > We haven't managed to reach a consensus on this before. My opinion is
> > that
> > > there is definitely value in having ML algorithms written natively in
> > > Flink, both for performance optimization,
> > > but more importantly for engineering simplicity, we don't want to force
> > > users to use yet another piece of software to run their ML algos (at
> > least
> > > for a basic set of algorithms).
> > >
> > > We have in the past  discussed integrations with DL4J (particularly
> ND4J)
> > > with Adam Gibson, the core developer of the library, but we never got
> > > around to implementing anything.
> > >
> > > Whether it makes sense to have an integration with DL4J as part of the
> > > Flink distribution would be up for discussion. I would suggest to make
> it
> > > an independent repo to allow for
> > > faster dev/release cycles, and because it wouldn't be directly related
> to
> > > the core of Flink so it would add extra reviewing burden to an already
> > > overloaded group of committers.
> > >
> > > Natively supporting GPU calculations in Flink would be much better
> > achieved
> > > through a library like ND4J, the engineering burden would be too much
> > > otherwise.
> > >
> > > Regards,
> > > Theodore
> > >
> > > On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <ka...@gmail.com>
> > > wrote:
> > >
> > > > Hello, guys.
> > > >
> > > > Theodore, last week I started the review of the PR:
> > > > https://github.com/apache/flink/pull/2735 related to *word2Vec for
> > > Flink*.
> > > >
> > > >
> > > >
> > > > During this review I have asked myself: why do we need to implement
> > such
> > > a
> > > > very popular algorithm like *word2vec one more time*, when there is
> > > already
> > > > available implementation in java provided by deeplearning4j.org
> > > > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2
> > > licence).
> > > > This library tries to promote itself, there is a hype around it in ML
> > > > sphere, and it was integrated with Apache Spark, to provide scalable
> > > > deeplearning calculations.
> > > >
> > > >
> > > > *That's why I thought: could we integrate with this library or not
> also
> > > and
> > > > Flink? *
> > > >
> > > > 1) Personally I think, providing support and deployment of
> > > > *Deeplearning(DL)
> > > > algorithms/models in Flink* is promising and attractive feature,
> > because:
> > > >
> > > >     a) during last two years DL proved its efficiency and these
> > > algorithms
> > > > used in many applications. For example *Spotify *uses DL based
> > algorithms
> > > > for music content extraction: Recommending music on Spotify with deep
> > > > learning AUGUST 05, 2014
> > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their
> > music
> > > > recommendations. Developers need to scale up DL manually, that causes
> a
> > > lot
> > > > of work, so that’s why such platforms like Flink should support these
> > > > models deployment.
> > > >
> > > >     b) Here is presented the scope of Deeplearning usage cases
> > > > <https://deeplearning4j.org/use_cases>, so many of this scenarios
> > > related
> > > > to scenarios, that could be supported on Flink.
> > > >
> > > >
> > > > 2) But DL uncover such questions like:
> > > >
> > > >     a) scale up calculations over machines
> > > >
> > > >     b) perform these calculations both over CPU and GPU. GPU is
> > required
> > > to
> > > > train big DL models, otherwise learning process could have very slow
> > > > convergence.
> > > >
> > > >
> > > > 3) I have checked this DL4J library, which already have reach support
> > of
> > > > many attractive DL models like: Recurrent Networks and LSTMs,
> > > Convolutional
> > > > Networks (CNN), Restricted Boltzmann Machines (RBM) and others. So we
> > > won’t
> > > > need to implement them independently, but only provide the ability of
> > > > execution of this models over Flink cluster, the quite similar way
> like
> > > it
> > > > was integrated with Apache Spark.
> > > >
> > > >
> > > > Because of all of this I propose:
> > > >
> > > > 1)    To create new ticket in Flink’s JIRA for integration of Flink
> > with
> > > > DL4J and decide on which side this integration should be implemented.
> > > >
> > > > 2)    Support natively GPU resources in Flink and allow calculations
> > over
> > > > them, like that is described in this publication
> > > > https://www.oreilly.com/learning/accelerating-spark-
> > workloads-using-gpus
> > > >
> > > >
> > > >
> > > > *Regarding original issue Implement Word2Vec
> > > > <https://issues.apache.org/jira/browse/FLINK-2094>in Flink,  *I have
> > > > investigated its implementation in DL4J and  that implementation of
> > > > integration DL4J with Apache Spark, and got several points:
> > > >
> > > > It seems that idea of building of our own implementation of word2vec
> in
> > > > Flink not such a bad solution, because: This DL4J was forced to
> > > reimplement
> > > > its original word2Vec over Spark. I have checked the integration of
> > DL4J
> > > > with Spark, and found that it is too strongly coupled with Spark API,
> > so
> > > > that it is impossible just to take some DL4J API and reuse it,
> instead
> > we
> > > > need to implement independent integration for Flink.
> > > >
> > > > *That’s why we simply finish implementation of current PR
> > > > **independently **from
> > > > integration to DL4J.*
> > > >
> > > >
> > > >
> > > > Could you please provide your opinion regarding my questions and
> > points,
> > > > what do you think about them?
> > > >
> > > >
> > > >
> > > > пн, 6 февр. 2017 г. в 12:51, Katherin Eri <ka...@gmail.com>:
> > > >
> > > > > Sorry, guys I need to finish this letter first.
> > > > >   Full version of it will come shortly.
> > > > >
> > > > > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <katherinmail@gmail.com
> >:
> > > > >
> > > > > Hello, guys.
> > > > > Theodore, last week I started the review of the PR:
> > > > > https://github.com/apache/flink/pull/2735 related to *word2Vec for
> > > > Flink*.
> > > > >
> > > > > During this review I have asked myself: why do we need to implement
> > > such
> > > > a
> > > > > very popular algorithm like *word2vec one more time*, when there is
> > > > > already availabe implementation in java provided by
> > deeplearning4j.org
> > > > > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2
> > > > licence).
> > > > > This library tries to promote it self, there is a hype around it in
> > ML
> > > > > sphere, and  it was integrated with Apache Spark, to provide
> scalable
> > > > > deeplearning calculations.
> > > > > That's why I thought: could we integrate with this library or not
> > also
> > > > and
> > > > > Flink?
> > > > > 1) Personally I think, providing support and deployment of
> > Deeplearning
> > > > > algorithms/models in Flink is promising and attractive feature,
> > > because:
> > > > >     a) during last two years deeplearning proved its efficiency and
> > > this
> > > > > algorithms used in many applications. For example *Spotify *uses DL
> > > based
> > > > > algorithms for music content extraction: Recommending music on
> > Spotify
> > > > > with deep learning AUGUST 05, 2014
> > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their
> > > music
> > > > > recommendations. Doing this natively scalable is very attractive.
> > > > >
> > > > >
> > > > > I have investigated that implementation of integration DL4J with
> > Apache
> > > > > Spark, and got several points:
> > > > >
> > > > > 1) It seems that idea of building of our own implementation of
> > word2vec
> > > > > not such a bad solution, because the integration of DL4J with Spark
> > is
> > > > too
> > > > > strongly coupled with Saprk API and it will take time from the side
> > of
> > > > DL4J
> > > > > to adopt this integration to Flink. Also I have expected that we
> will
> > > be
> > > > > able to call just some API, it is not such thing.
> > > > > 2)
> > > > >
> > > > > https://deeplearning4j.org/use_cases
> > > > > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> > > > implementation-r-python/
> > > > >
> > > > >
> > > > > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <tr...@apache.org>:
> > > > >
> > > > > Hi Katherin,
> > > > >
> > > > > welcome to the Flink community. Always great to see new people
> > joining
> > > > the
> > > > > community :-)
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> > > > katherinmail@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > ok, I've got it.
> > > > > > I will take a look at  https://github.com/apache/flink/pull/2735
> .
> > > > > >
> > > > > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > > > > > theodoros.vasiloudis@gmail.com>:
> > > > > >
> > > > > > > Hello Katherin,
> > > > > > >
> > > > > > > Welcome to the Flink community!
> > > > > > >
> > > > > > > The ML component definitely needs a lot of work you are
> correct,
> > we
> > > > are
> > > > > > > facing similar problems to CEP, which we'll hopefully resolve
> > with
> > > > the
> > > > > > > restructuring Stephan has mentioned in that thread.
> > > > > > >
> > > > > > > If you'd like to help out with PRs we have many open, one I
> have
> > > > > started
> > > > > > > reviewing but got side-tracked is the Word2Vec one [1].
> > > > > > >
> > > > > > > Best,
> > > > > > > Theodore
> > > > > > >
> > > > > > > [1] https://github.com/apache/flink/pull/2735
> > > > > > >
> > > > > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <
> > fhueske@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Katherin,
> > > > > > > >
> > > > > > > > welcome to the Flink community!
> > > > > > > > Help with reviewing PRs is always very welcome and a great
> way
> > to
> > > > > > > > contribute.
> > > > > > > >
> > > > > > > > Best, Fabian
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> > > > katherinmail@gmail.com
> > > > > >:
> > > > > > > >
> > > > > > > > > Thank you, Timo.
> > > > > > > > > I have started the analysis of the topic.
> > > > > > > > > And if it necessary, I will try to perform the review of
> > other
> > > > > pulls)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <
> > twalthr@apache.org
> > > >:
> > > > > > > > >
> > > > > > > > > > Hi Katherin,
> > > > > > > > > >
> > > > > > > > > > great to hear that you would like to contribute! Welcome!
> > > > > > > > > >
> > > > > > > > > > I gave you contributor permissions. You can now assign
> > issues
> > > > to
> > > > > > > > > > yourself. I assigned FLINK-1750 to you.
> > > > > > > > > > Right now there are many open ML pull requests, you are
> > very
> > > > > > welcome
> > > > > > > to
> > > > > > > > > > review the code of others, too.
> > > > > > > > > >
> > > > > > > > > > Timo
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> > > > > > > > > > > Hello, All!
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'm Kate Eri, I'm java developer with 6-year enterprise
> > > > > > experience,
> > > > > > > > > also
> > > > > > > > > > I
> > > > > > > > > > > have some expertise with scala (half of the year).
> > > > > > > > > > >
> > > > > > > > > > > Last 2 years I have participated in several BigData
> > > projects
> > > > > that
> > > > > > > > were
> > > > > > > > > > > related to Machine Learning (Time series analysis,
> > > > Recommender
> > > > > > > > systems,
> > > > > > > > > > > Social networking) and ETL. I have experience with
> > Hadoop,
> > > > > Apache
> > > > > > > > Spark
> > > > > > > > > > and
> > > > > > > > > > > Hive.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I’m fond of ML topic, and I see that Flink project
> > requires
> > > > > some
> > > > > > > work
> > > > > > > > > in
> > > > > > > > > > > this area, so that’s why I would like to join Flink and
> > ask
> > > > me
> > > > > to
> > > > > > > > grant
> > > > > > > > > > the
> > > > > > > > > > > assignment of the ticket
> > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-1750
> > > > > > > > > > > to me.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: New Flink team member - Kate Eri.

Posted by Katherin Eri <ka...@gmail.com>.

Thank you, Till.

1)      Regarding ND4J, I didn’t know about such a pity and critical
restriction of it -> lack of sparsity optimizations, and you are right:
this issue is still actual for them. I saw that Flink uses Breeze, but I
thought its usage caused by some historical reasons.

2)      Regarding integration with DL4J, I have read the source code of
DL4J/Spark integration, that’s why I have declined my idea of reuse of
their word2vec implementation for now, for example. I can perform deeper
investigation of this topic, if it required.



So I feel that we have the following picture:

1)      DL integration investigation, could be part of Apache Bahir. I can
perform futher investigation of this topic, but I thik we need some
separated ticket for this to track this activity.

2)      GPU support, required for DL is interesting, but requires ND4J for
example.

3)      ND4J couldn’t be incorporated because it doesn’t support sparsity
<https://deeplearning4j.org/roadmap.html> [1].

Regarding ND4J is this the single blocker for incorporation of it or may be
some others known?


[1] https://deeplearning4j.org/roadmap.html

вт, 7 февр. 2017 г. в 16:26, Till Rohrmann <tr...@apache.org>:

Thanks for initiating this discussion Katherin. I think you're right that
in general it does not make sense to reinvent the wheel over and over
again. Especially if you only have limited resources at hand. So if we
could integrate Flink with some existing library that would be great.

In the past, however, we couldn't find a good library which provided enough
freedom to integrate it with Flink. Especially if you want to have
distributed and somewhat high-performance implementations of ML algorithms
you would have to take Flink's execution model (capabilities as well as
limitations) into account. That is mainly the reason why we started
implementing some of the algorithms "natively" on Flink.

If I remember correctly, then the problem with ND4J was and still is that
it does not support sparse matrices which was a requirement from our side.
As far as I know, it is quite common that you have sparse data structures
when dealing with large scale problems. That's why we built our own
abstraction which can have different implementations. Currently, the
default implementation uses Breeze.

I think the support for GPU based operations and the actual resource
management are two orthogonal things. The implementation would have to work
with no GPUs available anyway. If the system detects that GPUs are
available, then ideally it would exploit them. Thus, we could add this
feature later and maybe integrate it with FLINK-5131 [1].

Concerning the integration with DL4J I think that Theo's proposal to do it
in a separate repository (maybe as part of Apache Bahir) is a good idea.
We're currently thinking about outsourcing some of Flink's libraries into
sub projects. This could also be an option for the DL4J integration then.
In general I think it should be feasible to run DL4J on Flink given that it
also runs on Spark. Have you already looked at it closer?

[1] https://issues.apache.org/jira/browse/FLINK-5131

Cheers,
Till

On Tue, Feb 7, 2017 at 11:47 AM, Katherin Eri <ka...@gmail.com>
wrote:

> Thank you Theodore, for your reply.
>
> 1)    Regarding GPU, your point is clear and I agree with it, ND4J looks
> appropriate. But, my current understanding is that, we also need to cover
> some resource management questions -> when we need to provide GPU support
> we also need to manage it like resource. For example, Mesos has already
> supported GPU like resource item: Initial support for GPU resources.
> <https://issues.apache.org/jira/browse/MESOS-4424?jql=text%20~%20GPU>
> Flink
> uses Mesos as cluster manager, and this means that this feature of Mesos
> could be reused. Also memory managing questions in Flink regarding GPU
> should be clarified.
>
> 2)    Regarding integration with DL4J: what stops us to initialize ticket
> and start the discussion around this topic? We need some user story or the
> community is not sure that DL is really helpful? Why the discussion with
> Adam
> Gibson just finished with no implementation of any idea? What concerns do
> we have?
>
> пн, 6 февр. 2017 г. в 15:01, Theodore Vasiloudis <
> theodoros.vasiloudis@gmail.com>:
>
> > Hell all,
> >
> > This is point that has come up in the past: Given the multitude of ML
> > libraries out there, should we have native implementations in FlinkML or
> > try to integrate other libraries instead?
> >
> > We haven't managed to reach a consensus on this before. My opinion is
> that
> > there is definitely value in having ML algorithms written natively in
> > Flink, both for performance optimization,
> > but more importantly for engineering simplicity, we don't want to force
> > users to use yet another piece of software to run their ML algos (at
> least
> > for a basic set of algorithms).
> >
> > We have in the past  discussed integrations with DL4J (particularly
ND4J)
> > with Adam Gibson, the core developer of the library, but we never got
> > around to implementing anything.
> >
> > Whether it makes sense to have an integration with DL4J as part of the
> > Flink distribution would be up for discussion. I would suggest to make
it
> > an independent repo to allow for
> > faster dev/release cycles, and because it wouldn't be directly related
to
> > the core of Flink so it would add extra reviewing burden to an already
> > overloaded group of committers.
> >
> > Natively supporting GPU calculations in Flink would be much better
> achieved
> > through a library like ND4J, the engineering burden would be too much
> > otherwise.
> >
> > Regards,
> > Theodore
> >
> > On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <ka...@gmail.com>
> > wrote:
> >
> > > Hello, guys.
> > >
> > > Theodore, last week I started the review of the PR:
> > > https://github.com/apache/flink/pull/2735 related to *word2Vec for
> > Flink*.
> > >
> > >
> > >
> > > During this review I have asked myself: why do we need to implement
> such
> > a
> > > very popular algorithm like *word2vec one more time*, when there is
> > already
> > > available implementation in java provided by deeplearning4j.org
> > > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2
> > licence).
> > > This library tries to promote itself, there is a hype around it in ML
> > > sphere, and it was integrated with Apache Spark, to provide scalable
> > > deeplearning calculations.
> > >
> > >
> > > *That's why I thought: could we integrate with this library or not
also
> > and
> > > Flink? *
> > >
> > > 1) Personally I think, providing support and deployment of
> > > *Deeplearning(DL)
> > > algorithms/models in Flink* is promising and attractive feature,
> because:
> > >
> > >     a) during last two years DL proved its efficiency and these
> > algorithms
> > > used in many applications. For example *Spotify *uses DL based
> algorithms
> > > for music content extraction: Recommending music on Spotify with deep
> > > learning AUGUST 05, 2014
> > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their
> music
> > > recommendations. Developers need to scale up DL manually, that causes
a
> > lot
> > > of work, so that’s why such platforms like Flink should support these
> > > models deployment.
> > >
> > >     b) Here is presented the scope of Deeplearning usage cases
> > > <https://deeplearning4j.org/use_cases>, so many of this scenarios
> > related
> > > to scenarios, that could be supported on Flink.
> > >
> > >
> > > 2) But DL uncover such questions like:
> > >
> > >     a) scale up calculations over machines
> > >
> > >     b) perform these calculations both over CPU and GPU. GPU is
> required
> > to
> > > train big DL models, otherwise learning process could have very slow
> > > convergence.
> > >
> > >
> > > 3) I have checked this DL4J library, which already have reach support
> of
> > > many attractive DL models like: Recurrent Networks and LSTMs,
> > Convolutional
> > > Networks (CNN), Restricted Boltzmann Machines (RBM) and others. So we
> > won’t
> > > need to implement them independently, but only provide the ability of
> > > execution of this models over Flink cluster, the quite similar way
like
> > it
> > > was integrated with Apache Spark.
> > >
> > >
> > > Because of all of this I propose:
> > >
> > > 1)    To create new ticket in Flink’s JIRA for integration of Flink
> with
> > > DL4J and decide on which side this integration should be implemented.
> > >
> > > 2)    Support natively GPU resources in Flink and allow calculations
> over
> > > them, like that is described in this publication
> > > https://www.oreilly.com/learning/accelerating-spark-
> workloads-using-gpus
> > >
> > >
> > >
> > > *Regarding original issue Implement Word2Vec
> > > <https://issues.apache.org/jira/browse/FLINK-2094>in Flink,  *I have
> > > investigated its implementation in DL4J and  that implementation of
> > > integration DL4J with Apache Spark, and got several points:
> > >
> > > It seems that idea of building of our own implementation of word2vec
in
> > > Flink not such a bad solution, because: This DL4J was forced to
> > reimplement
> > > its original word2Vec over Spark. I have checked the integration of
> DL4J
> > > with Spark, and found that it is too strongly coupled with Spark API,
> so
> > > that it is impossible just to take some DL4J API and reuse it, instead
> we
> > > need to implement independent integration for Flink.
> > >
> > > *That’s why we simply finish implementation of current PR
> > > **independently **from
> > > integration to DL4J.*
> > >
> > >
> > >
> > > Could you please provide your opinion regarding my questions and
> points,
> > > what do you think about them?
> > >
> > >
> > >
> > > пн, 6 февр. 2017 г. в 12:51, Katherin Eri <ka...@gmail.com>:
> > >
> > > > Sorry, guys I need to finish this letter first.
> > > >   Full version of it will come shortly.
> > > >
> > > > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <ka...@gmail.com>:
> > > >
> > > > Hello, guys.
> > > > Theodore, last week I started the review of the PR:
> > > > https://github.com/apache/flink/pull/2735 related to *word2Vec for
> > > Flink*.
> > > >
> > > > During this review I have asked myself: why do we need to implement
> > such
> > > a
> > > > very popular algorithm like *word2vec one more time*, when there is
> > > > already availabe implementation in java provided by
> deeplearning4j.org
> > > > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2
> > > licence).
> > > > This library tries to promote it self, there is a hype around it in
> ML
> > > > sphere, and  it was integrated with Apache Spark, to provide
scalable
> > > > deeplearning calculations.
> > > > That's why I thought: could we integrate with this library or not
> also
> > > and
> > > > Flink?
> > > > 1) Personally I think, providing support and deployment of
> Deeplearning
> > > > algorithms/models in Flink is promising and attractive feature,
> > because:
> > > >     a) during last two years deeplearning proved its efficiency and
> > this
> > > > algorithms used in many applications. For example *Spotify *uses DL
> > based
> > > > algorithms for music content extraction: Recommending music on
> Spotify
> > > > with deep learning AUGUST 05, 2014
> > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their
> > music
> > > > recommendations. Doing this natively scalable is very attractive.
> > > >
> > > >
> > > > I have investigated that implementation of integration DL4J with
> Apache
> > > > Spark, and got several points:
> > > >
> > > > 1) It seems that idea of building of our own implementation of
> word2vec
> > > > not such a bad solution, because the integration of DL4J with Spark
> is
> > > too
> > > > strongly coupled with Saprk API and it will take time from the side
> of
> > > DL4J
> > > > to adopt this integration to Flink. Also I have expected that we
will
> > be
> > > > able to call just some API, it is not such thing.
> > > > 2)
> > > >
> > > > https://deeplearning4j.org/use_cases
> > > > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> > > implementation-r-python/
> > > >
> > > >
> > > > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <tr...@apache.org>:
> > > >
> > > > Hi Katherin,
> > > >
> > > > welcome to the Flink community. Always great to see new people
> joining
> > > the
> > > > community :-)
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> > > katherinmail@gmail.com>
> > > > wrote:
> > > >
> > > > > ok, I've got it.
> > > > > I will take a look at  https://github.com/apache/flink/pull/2735.
> > > > >
> > > > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > > > > theodoros.vasiloudis@gmail.com>:
> > > > >
> > > > > > Hello Katherin,
> > > > > >
> > > > > > Welcome to the Flink community!
> > > > > >
> > > > > > The ML component definitely needs a lot of work you are correct,
> we
> > > are
> > > > > > facing similar problems to CEP, which we'll hopefully resolve
> with
> > > the
> > > > > > restructuring Stephan has mentioned in that thread.
> > > > > >
> > > > > > If you'd like to help out with PRs we have many open, one I have
> > > > started
> > > > > > reviewing but got side-tracked is the Word2Vec one [1].
> > > > > >
> > > > > > Best,
> > > > > > Theodore
> > > > > >
> > > > > > [1] https://github.com/apache/flink/pull/2735
> > > > > >
> > > > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <
> fhueske@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi Katherin,
> > > > > > >
> > > > > > > welcome to the Flink community!
> > > > > > > Help with reviewing PRs is always very welcome and a great way
> to
> > > > > > > contribute.
> > > > > > >
> > > > > > > Best, Fabian
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> > > katherinmail@gmail.com
> > > > >:
> > > > > > >
> > > > > > > > Thank you, Timo.
> > > > > > > > I have started the analysis of the topic.
> > > > > > > > And if it necessary, I will try to perform the review of
> other
> > > > pulls)
> > > > > > > >
> > > > > > > >
> > > > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <
> twalthr@apache.org
> > >:
> > > > > > > >
> > > > > > > > > Hi Katherin,
> > > > > > > > >
> > > > > > > > > great to hear that you would like to contribute! Welcome!
> > > > > > > > >
> > > > > > > > > I gave you contributor permissions. You can now assign
> issues
> > > to
> > > > > > > > > yourself. I assigned FLINK-1750 to you.
> > > > > > > > > Right now there are many open ML pull requests, you are
> very
> > > > > welcome
> > > > > > to
> > > > > > > > > review the code of others, too.
> > > > > > > > >
> > > > > > > > > Timo
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> > > > > > > > > > Hello, All!
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'm Kate Eri, I'm java developer with 6-year enterprise
> > > > > experience,
> > > > > > > > also
> > > > > > > > > I
> > > > > > > > > > have some expertise with scala (half of the year).
> > > > > > > > > >
> > > > > > > > > > Last 2 years I have participated in several BigData
> > projects
> > > > that
> > > > > > > were
> > > > > > > > > > related to Machine Learning (Time series analysis,
> > > Recommender
> > > > > > > systems,
> > > > > > > > > > Social networking) and ETL. I have experience with
> Hadoop,
> > > > Apache
> > > > > > > Spark
> > > > > > > > > and
> > > > > > > > > > Hive.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I’m fond of ML topic, and I see that Flink project
> requires
> > > > some
> > > > > > work
> > > > > > > > in
> > > > > > > > > > this area, so that’s why I would like to join Flink and
> ask
> > > me
> > > > to
> > > > > > > grant
> > > > > > > > > the
> > > > > > > > > > assignment of the ticket
> > > > > > > > > https://issues.apache.org/jira/browse/FLINK-1750
> > > > > > > > > > to me.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
>

Re: New Flink team member - Kate Eri.

Posted by Till Rohrmann <tr...@apache.org>.

Thanks for initiating this discussion Katherin. I think you're right that
in general it does not make sense to reinvent the wheel over and over
again. Especially if you only have limited resources at hand. So if we
could integrate Flink with some existing library that would be great.

In the past, however, we couldn't find a good library which provided enough
freedom to integrate it with Flink. Especially if you want to have
distributed and somewhat high-performance implementations of ML algorithms
you would have to take Flink's execution model (capabilities as well as
limitations) into account. That is mainly the reason why we started
implementing some of the algorithms "natively" on Flink.

If I remember correctly, then the problem with ND4J was and still is that
it does not support sparse matrices which was a requirement from our side.
As far as I know, it is quite common that you have sparse data structures
when dealing with large scale problems. That's why we built our own
abstraction which can have different implementations. Currently, the
default implementation uses Breeze.

I think the support for GPU based operations and the actual resource
management are two orthogonal things. The implementation would have to work
with no GPUs available anyway. If the system detects that GPUs are
available, then ideally it would exploit them. Thus, we could add this
feature later and maybe integrate it with FLINK-5131 [1].

Concerning the integration with DL4J I think that Theo's proposal to do it
in a separate repository (maybe as part of Apache Bahir) is a good idea.
We're currently thinking about outsourcing some of Flink's libraries into
sub projects. This could also be an option for the DL4J integration then.
In general I think it should be feasible to run DL4J on Flink given that it
also runs on Spark. Have you already looked at it closer?

[1] https://issues.apache.org/jira/browse/FLINK-5131

Cheers,
Till

On Tue, Feb 7, 2017 at 11:47 AM, Katherin Eri <ka...@gmail.com>
wrote:

> Thank you Theodore, for your reply.
>
> 1)    Regarding GPU, your point is clear and I agree with it, ND4J looks
> appropriate. But, my current understanding is that, we also need to cover
> some resource management questions -> when we need to provide GPU support
> we also need to manage it like resource. For example, Mesos has already
> supported GPU like resource item: Initial support for GPU resources.
> <https://issues.apache.org/jira/browse/MESOS-4424?jql=text%20~%20GPU>
> Flink
> uses Mesos as cluster manager, and this means that this feature of Mesos
> could be reused. Also memory managing questions in Flink regarding GPU
> should be clarified.
>
> 2)    Regarding integration with DL4J: what stops us to initialize ticket
> and start the discussion around this topic? We need some user story or the
> community is not sure that DL is really helpful? Why the discussion with
> Adam
> Gibson just finished with no implementation of any idea? What concerns do
> we have?
>
> пн, 6 февр. 2017 г. в 15:01, Theodore Vasiloudis <
> theodoros.vasiloudis@gmail.com>:
>
> > Hell all,
> >
> > This is point that has come up in the past: Given the multitude of ML
> > libraries out there, should we have native implementations in FlinkML or
> > try to integrate other libraries instead?
> >
> > We haven't managed to reach a consensus on this before. My opinion is
> that
> > there is definitely value in having ML algorithms written natively in
> > Flink, both for performance optimization,
> > but more importantly for engineering simplicity, we don't want to force
> > users to use yet another piece of software to run their ML algos (at
> least
> > for a basic set of algorithms).
> >
> > We have in the past  discussed integrations with DL4J (particularly ND4J)
> > with Adam Gibson, the core developer of the library, but we never got
> > around to implementing anything.
> >
> > Whether it makes sense to have an integration with DL4J as part of the
> > Flink distribution would be up for discussion. I would suggest to make it
> > an independent repo to allow for
> > faster dev/release cycles, and because it wouldn't be directly related to
> > the core of Flink so it would add extra reviewing burden to an already
> > overloaded group of committers.
> >
> > Natively supporting GPU calculations in Flink would be much better
> achieved
> > through a library like ND4J, the engineering burden would be too much
> > otherwise.
> >
> > Regards,
> > Theodore
> >
> > On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <ka...@gmail.com>
> > wrote:
> >
> > > Hello, guys.
> > >
> > > Theodore, last week I started the review of the PR:
> > > https://github.com/apache/flink/pull/2735 related to *word2Vec for
> > Flink*.
> > >
> > >
> > >
> > > During this review I have asked myself: why do we need to implement
> such
> > a
> > > very popular algorithm like *word2vec one more time*, when there is
> > already
> > > available implementation in java provided by deeplearning4j.org
> > > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2
> > licence).
> > > This library tries to promote itself, there is a hype around it in ML
> > > sphere, and it was integrated with Apache Spark, to provide scalable
> > > deeplearning calculations.
> > >
> > >
> > > *That's why I thought: could we integrate with this library or not also
> > and
> > > Flink? *
> > >
> > > 1) Personally I think, providing support and deployment of
> > > *Deeplearning(DL)
> > > algorithms/models in Flink* is promising and attractive feature,
> because:
> > >
> > >     a) during last two years DL proved its efficiency and these
> > algorithms
> > > used in many applications. For example *Spotify *uses DL based
> algorithms
> > > for music content extraction: Recommending music on Spotify with deep
> > > learning AUGUST 05, 2014
> > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their
> music
> > > recommendations. Developers need to scale up DL manually, that causes a
> > lot
> > > of work, so that’s why such platforms like Flink should support these
> > > models deployment.
> > >
> > >     b) Here is presented the scope of Deeplearning usage cases
> > > <https://deeplearning4j.org/use_cases>, so many of this scenarios
> > related
> > > to scenarios, that could be supported on Flink.
> > >
> > >
> > > 2) But DL uncover such questions like:
> > >
> > >     a) scale up calculations over machines
> > >
> > >     b) perform these calculations both over CPU and GPU. GPU is
> required
> > to
> > > train big DL models, otherwise learning process could have very slow
> > > convergence.
> > >
> > >
> > > 3) I have checked this DL4J library, which already have reach support
> of
> > > many attractive DL models like: Recurrent Networks and LSTMs,
> > Convolutional
> > > Networks (CNN), Restricted Boltzmann Machines (RBM) and others. So we
> > won’t
> > > need to implement them independently, but only provide the ability of
> > > execution of this models over Flink cluster, the quite similar way like
> > it
> > > was integrated with Apache Spark.
> > >
> > >
> > > Because of all of this I propose:
> > >
> > > 1)    To create new ticket in Flink’s JIRA for integration of Flink
> with
> > > DL4J and decide on which side this integration should be implemented.
> > >
> > > 2)    Support natively GPU resources in Flink and allow calculations
> over
> > > them, like that is described in this publication
> > > https://www.oreilly.com/learning/accelerating-spark-
> workloads-using-gpus
> > >
> > >
> > >
> > > *Regarding original issue Implement Word2Vec
> > > <https://issues.apache.org/jira/browse/FLINK-2094>in Flink,  *I have
> > > investigated its implementation in DL4J and  that implementation of
> > > integration DL4J with Apache Spark, and got several points:
> > >
> > > It seems that idea of building of our own implementation of word2vec in
> > > Flink not such a bad solution, because: This DL4J was forced to
> > reimplement
> > > its original word2Vec over Spark. I have checked the integration of
> DL4J
> > > with Spark, and found that it is too strongly coupled with Spark API,
> so
> > > that it is impossible just to take some DL4J API and reuse it, instead
> we
> > > need to implement independent integration for Flink.
> > >
> > > *That’s why we simply finish implementation of current PR
> > > **independently **from
> > > integration to DL4J.*
> > >
> > >
> > >
> > > Could you please provide your opinion regarding my questions and
> points,
> > > what do you think about them?
> > >
> > >
> > >
> > > пн, 6 февр. 2017 г. в 12:51, Katherin Eri <ka...@gmail.com>:
> > >
> > > > Sorry, guys I need to finish this letter first.
> > > >   Full version of it will come shortly.
> > > >
> > > > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <ka...@gmail.com>:
> > > >
> > > > Hello, guys.
> > > > Theodore, last week I started the review of the PR:
> > > > https://github.com/apache/flink/pull/2735 related to *word2Vec for
> > > Flink*.
> > > >
> > > > During this review I have asked myself: why do we need to implement
> > such
> > > a
> > > > very popular algorithm like *word2vec one more time*, when there is
> > > > already availabe implementation in java provided by
> deeplearning4j.org
> > > > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2
> > > licence).
> > > > This library tries to promote it self, there is a hype around it in
> ML
> > > > sphere, and  it was integrated with Apache Spark, to provide scalable
> > > > deeplearning calculations.
> > > > That's why I thought: could we integrate with this library or not
> also
> > > and
> > > > Flink?
> > > > 1) Personally I think, providing support and deployment of
> Deeplearning
> > > > algorithms/models in Flink is promising and attractive feature,
> > because:
> > > >     a) during last two years deeplearning proved its efficiency and
> > this
> > > > algorithms used in many applications. For example *Spotify *uses DL
> > based
> > > > algorithms for music content extraction: Recommending music on
> Spotify
> > > > with deep learning AUGUST 05, 2014
> > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their
> > music
> > > > recommendations. Doing this natively scalable is very attractive.
> > > >
> > > >
> > > > I have investigated that implementation of integration DL4J with
> Apache
> > > > Spark, and got several points:
> > > >
> > > > 1) It seems that idea of building of our own implementation of
> word2vec
> > > > not such a bad solution, because the integration of DL4J with Spark
> is
> > > too
> > > > strongly coupled with Saprk API and it will take time from the side
> of
> > > DL4J
> > > > to adopt this integration to Flink. Also I have expected that we will
> > be
> > > > able to call just some API, it is not such thing.
> > > > 2)
> > > >
> > > > https://deeplearning4j.org/use_cases
> > > > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> > > implementation-r-python/
> > > >
> > > >
> > > > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <tr...@apache.org>:
> > > >
> > > > Hi Katherin,
> > > >
> > > > welcome to the Flink community. Always great to see new people
> joining
> > > the
> > > > community :-)
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> > > katherinmail@gmail.com>
> > > > wrote:
> > > >
> > > > > ok, I've got it.
> > > > > I will take a look at  https://github.com/apache/flink/pull/2735.
> > > > >
> > > > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > > > > theodoros.vasiloudis@gmail.com>:
> > > > >
> > > > > > Hello Katherin,
> > > > > >
> > > > > > Welcome to the Flink community!
> > > > > >
> > > > > > The ML component definitely needs a lot of work you are correct,
> we
> > > are
> > > > > > facing similar problems to CEP, which we'll hopefully resolve
> with
> > > the
> > > > > > restructuring Stephan has mentioned in that thread.
> > > > > >
> > > > > > If you'd like to help out with PRs we have many open, one I have
> > > > started
> > > > > > reviewing but got side-tracked is the Word2Vec one [1].
> > > > > >
> > > > > > Best,
> > > > > > Theodore
> > > > > >
> > > > > > [1] https://github.com/apache/flink/pull/2735
> > > > > >
> > > > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <
> fhueske@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi Katherin,
> > > > > > >
> > > > > > > welcome to the Flink community!
> > > > > > > Help with reviewing PRs is always very welcome and a great way
> to
> > > > > > > contribute.
> > > > > > >
> > > > > > > Best, Fabian
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> > > katherinmail@gmail.com
> > > > >:
> > > > > > >
> > > > > > > > Thank you, Timo.
> > > > > > > > I have started the analysis of the topic.
> > > > > > > > And if it necessary, I will try to perform the review of
> other
> > > > pulls)
> > > > > > > >
> > > > > > > >
> > > > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <
> twalthr@apache.org
> > >:
> > > > > > > >
> > > > > > > > > Hi Katherin,
> > > > > > > > >
> > > > > > > > > great to hear that you would like to contribute! Welcome!
> > > > > > > > >
> > > > > > > > > I gave you contributor permissions. You can now assign
> issues
> > > to
> > > > > > > > > yourself. I assigned FLINK-1750 to you.
> > > > > > > > > Right now there are many open ML pull requests, you are
> very
> > > > > welcome
> > > > > > to
> > > > > > > > > review the code of others, too.
> > > > > > > > >
> > > > > > > > > Timo
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> > > > > > > > > > Hello, All!
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'm Kate Eri, I'm java developer with 6-year enterprise
> > > > > experience,
> > > > > > > > also
> > > > > > > > > I
> > > > > > > > > > have some expertise with scala (half of the year).
> > > > > > > > > >
> > > > > > > > > > Last 2 years I have participated in several BigData
> > projects
> > > > that
> > > > > > > were
> > > > > > > > > > related to Machine Learning (Time series analysis,
> > > Recommender
> > > > > > > systems,
> > > > > > > > > > Social networking) and ETL. I have experience with
> Hadoop,
> > > > Apache
> > > > > > > Spark
> > > > > > > > > and
> > > > > > > > > > Hive.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I’m fond of ML topic, and I see that Flink project
> requires
> > > > some
> > > > > > work
> > > > > > > > in
> > > > > > > > > > this area, so that’s why I would like to join Flink and
> ask
> > > me
> > > > to
> > > > > > > grant
> > > > > > > > > the
> > > > > > > > > > assignment of the ticket
> > > > > > > > > https://issues.apache.org/jira/browse/FLINK-1750
> > > > > > > > > > to me.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
>

Re: New Flink team member - Kate Eri.

Posted by Katherin Eri <ka...@gmail.com>.

Thank you Theodore, for your reply.

1)    Regarding GPU, your point is clear and I agree with it, ND4J looks
appropriate. But, my current understanding is that, we also need to cover
some resource management questions -> when we need to provide GPU support
we also need to manage it like resource. For example, Mesos has already
supported GPU like resource item: Initial support for GPU resources.
<https://issues.apache.org/jira/browse/MESOS-4424?jql=text%20~%20GPU> Flink
uses Mesos as cluster manager, and this means that this feature of Mesos
could be reused. Also memory managing questions in Flink regarding GPU
should be clarified.

2)    Regarding integration with DL4J: what stops us to initialize ticket
and start the discussion around this topic? We need some user story or the
community is not sure that DL is really helpful? Why the discussion with Adam
Gibson just finished with no implementation of any idea? What concerns do
we have?

пн, 6 февр. 2017 г. в 15:01, Theodore Vasiloudis <
theodoros.vasiloudis@gmail.com>:

> Hell all,
>
> This is point that has come up in the past: Given the multitude of ML
> libraries out there, should we have native implementations in FlinkML or
> try to integrate other libraries instead?
>
> We haven't managed to reach a consensus on this before. My opinion is that
> there is definitely value in having ML algorithms written natively in
> Flink, both for performance optimization,
> but more importantly for engineering simplicity, we don't want to force
> users to use yet another piece of software to run their ML algos (at least
> for a basic set of algorithms).
>
> We have in the past  discussed integrations with DL4J (particularly ND4J)
> with Adam Gibson, the core developer of the library, but we never got
> around to implementing anything.
>
> Whether it makes sense to have an integration with DL4J as part of the
> Flink distribution would be up for discussion. I would suggest to make it
> an independent repo to allow for
> faster dev/release cycles, and because it wouldn't be directly related to
> the core of Flink so it would add extra reviewing burden to an already
> overloaded group of committers.
>
> Natively supporting GPU calculations in Flink would be much better achieved
> through a library like ND4J, the engineering burden would be too much
> otherwise.
>
> Regards,
> Theodore
>
> On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <ka...@gmail.com>
> wrote:
>
> > Hello, guys.
> >
> > Theodore, last week I started the review of the PR:
> > https://github.com/apache/flink/pull/2735 related to *word2Vec for
> Flink*.
> >
> >
> >
> > During this review I have asked myself: why do we need to implement such
> a
> > very popular algorithm like *word2vec one more time*, when there is
> already
> > available implementation in java provided by deeplearning4j.org
> > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2
> licence).
> > This library tries to promote itself, there is a hype around it in ML
> > sphere, and it was integrated with Apache Spark, to provide scalable
> > deeplearning calculations.
> >
> >
> > *That's why I thought: could we integrate with this library or not also
> and
> > Flink? *
> >
> > 1) Personally I think, providing support and deployment of
> > *Deeplearning(DL)
> > algorithms/models in Flink* is promising and attractive feature, because:
> >
> >     a) during last two years DL proved its efficiency and these
> algorithms
> > used in many applications. For example *Spotify *uses DL based algorithms
> > for music content extraction: Recommending music on Spotify with deep
> > learning AUGUST 05, 2014
> > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their music
> > recommendations. Developers need to scale up DL manually, that causes a
> lot
> > of work, so that’s why such platforms like Flink should support these
> > models deployment.
> >
> >     b) Here is presented the scope of Deeplearning usage cases
> > <https://deeplearning4j.org/use_cases>, so many of this scenarios
> related
> > to scenarios, that could be supported on Flink.
> >
> >
> > 2) But DL uncover such questions like:
> >
> >     a) scale up calculations over machines
> >
> >     b) perform these calculations both over CPU and GPU. GPU is required
> to
> > train big DL models, otherwise learning process could have very slow
> > convergence.
> >
> >
> > 3) I have checked this DL4J library, which already have reach support of
> > many attractive DL models like: Recurrent Networks and LSTMs,
> Convolutional
> > Networks (CNN), Restricted Boltzmann Machines (RBM) and others. So we
> won’t
> > need to implement them independently, but only provide the ability of
> > execution of this models over Flink cluster, the quite similar way like
> it
> > was integrated with Apache Spark.
> >
> >
> > Because of all of this I propose:
> >
> > 1)    To create new ticket in Flink’s JIRA for integration of Flink with
> > DL4J and decide on which side this integration should be implemented.
> >
> > 2)    Support natively GPU resources in Flink and allow calculations over
> > them, like that is described in this publication
> > https://www.oreilly.com/learning/accelerating-spark-workloads-using-gpus
> >
> >
> >
> > *Regarding original issue Implement Word2Vec
> > <https://issues.apache.org/jira/browse/FLINK-2094>in Flink,  *I have
> > investigated its implementation in DL4J and  that implementation of
> > integration DL4J with Apache Spark, and got several points:
> >
> > It seems that idea of building of our own implementation of word2vec in
> > Flink not such a bad solution, because: This DL4J was forced to
> reimplement
> > its original word2Vec over Spark. I have checked the integration of DL4J
> > with Spark, and found that it is too strongly coupled with Spark API, so
> > that it is impossible just to take some DL4J API and reuse it, instead we
> > need to implement independent integration for Flink.
> >
> > *That’s why we simply finish implementation of current PR
> > **independently **from
> > integration to DL4J.*
> >
> >
> >
> > Could you please provide your opinion regarding my questions and points,
> > what do you think about them?
> >
> >
> >
> > пн, 6 февр. 2017 г. в 12:51, Katherin Eri <ka...@gmail.com>:
> >
> > > Sorry, guys I need to finish this letter first.
> > >   Full version of it will come shortly.
> > >
> > > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <ka...@gmail.com>:
> > >
> > > Hello, guys.
> > > Theodore, last week I started the review of the PR:
> > > https://github.com/apache/flink/pull/2735 related to *word2Vec for
> > Flink*.
> > >
> > > During this review I have asked myself: why do we need to implement
> such
> > a
> > > very popular algorithm like *word2vec one more time*, when there is
> > > already availabe implementation in java provided by deeplearning4j.org
> > > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2
> > licence).
> > > This library tries to promote it self, there is a hype around it in ML
> > > sphere, and  it was integrated with Apache Spark, to provide scalable
> > > deeplearning calculations.
> > > That's why I thought: could we integrate with this library or not also
> > and
> > > Flink?
> > > 1) Personally I think, providing support and deployment of Deeplearning
> > > algorithms/models in Flink is promising and attractive feature,
> because:
> > >     a) during last two years deeplearning proved its efficiency and
> this
> > > algorithms used in many applications. For example *Spotify *uses DL
> based
> > > algorithms for music content extraction: Recommending music on Spotify
> > > with deep learning AUGUST 05, 2014
> > > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their
> music
> > > recommendations. Doing this natively scalable is very attractive.
> > >
> > >
> > > I have investigated that implementation of integration DL4J with Apache
> > > Spark, and got several points:
> > >
> > > 1) It seems that idea of building of our own implementation of word2vec
> > > not such a bad solution, because the integration of DL4J with Spark is
> > too
> > > strongly coupled with Saprk API and it will take time from the side of
> > DL4J
> > > to adopt this integration to Flink. Also I have expected that we will
> be
> > > able to call just some API, it is not such thing.
> > > 2)
> > >
> > > https://deeplearning4j.org/use_cases
> > > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> > implementation-r-python/
> > >
> > >
> > > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <tr...@apache.org>:
> > >
> > > Hi Katherin,
> > >
> > > welcome to the Flink community. Always great to see new people joining
> > the
> > > community :-)
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> > katherinmail@gmail.com>
> > > wrote:
> > >
> > > > ok, I've got it.
> > > > I will take a look at  https://github.com/apache/flink/pull/2735.
> > > >
> > > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > > > theodoros.vasiloudis@gmail.com>:
> > > >
> > > > > Hello Katherin,
> > > > >
> > > > > Welcome to the Flink community!
> > > > >
> > > > > The ML component definitely needs a lot of work you are correct, we
> > are
> > > > > facing similar problems to CEP, which we'll hopefully resolve with
> > the
> > > > > restructuring Stephan has mentioned in that thread.
> > > > >
> > > > > If you'd like to help out with PRs we have many open, one I have
> > > started
> > > > > reviewing but got side-tracked is the Word2Vec one [1].
> > > > >
> > > > > Best,
> > > > > Theodore
> > > > >
> > > > > [1] https://github.com/apache/flink/pull/2735
> > > > >
> > > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <fhueske@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi Katherin,
> > > > > >
> > > > > > welcome to the Flink community!
> > > > > > Help with reviewing PRs is always very welcome and a great way to
> > > > > > contribute.
> > > > > >
> > > > > > Best, Fabian
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> > katherinmail@gmail.com
> > > >:
> > > > > >
> > > > > > > Thank you, Timo.
> > > > > > > I have started the analysis of the topic.
> > > > > > > And if it necessary, I will try to perform the review of other
> > > pulls)
> > > > > > >
> > > > > > >
> > > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <twalthr@apache.org
> >:
> > > > > > >
> > > > > > > > Hi Katherin,
> > > > > > > >
> > > > > > > > great to hear that you would like to contribute! Welcome!
> > > > > > > >
> > > > > > > > I gave you contributor permissions. You can now assign issues
> > to
> > > > > > > > yourself. I assigned FLINK-1750 to you.
> > > > > > > > Right now there are many open ML pull requests, you are very
> > > > welcome
> > > > > to
> > > > > > > > review the code of others, too.
> > > > > > > >
> > > > > > > > Timo
> > > > > > > >
> > > > > > > >
> > > > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> > > > > > > > > Hello, All!
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm Kate Eri, I'm java developer with 6-year enterprise
> > > > experience,
> > > > > > > also
> > > > > > > > I
> > > > > > > > > have some expertise with scala (half of the year).
> > > > > > > > >
> > > > > > > > > Last 2 years I have participated in several BigData
> projects
> > > that
> > > > > > were
> > > > > > > > > related to Machine Learning (Time series analysis,
> > Recommender
> > > > > > systems,
> > > > > > > > > Social networking) and ETL. I have experience with Hadoop,
> > > Apache
> > > > > > Spark
> > > > > > > > and
> > > > > > > > > Hive.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I’m fond of ML topic, and I see that Flink project requires
> > > some
> > > > > work
> > > > > > > in
> > > > > > > > > this area, so that’s why I would like to join Flink and ask
> > me
> > > to
> > > > > > grant
> > > > > > > > the
> > > > > > > > > assignment of the ticket
> > > > > > > > https://issues.apache.org/jira/browse/FLINK-1750
> > > > > > > > > to me.
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> >
>

Re: New Flink team member - Kate Eri.

Posted by Theodore Vasiloudis <th...@gmail.com>.

Hell all,

This is point that has come up in the past: Given the multitude of ML
libraries out there, should we have native implementations in FlinkML or
try to integrate other libraries instead?

We haven't managed to reach a consensus on this before. My opinion is that
there is definitely value in having ML algorithms written natively in
Flink, both for performance optimization,
but more importantly for engineering simplicity, we don't want to force
users to use yet another piece of software to run their ML algos (at least
for a basic set of algorithms).

We have in the past  discussed integrations with DL4J (particularly ND4J)
with Adam Gibson, the core developer of the library, but we never got
around to implementing anything.

Whether it makes sense to have an integration with DL4J as part of the
Flink distribution would be up for discussion. I would suggest to make it
an independent repo to allow for
faster dev/release cycles, and because it wouldn't be directly related to
the core of Flink so it would add extra reviewing burden to an already
overloaded group of committers.

Natively supporting GPU calculations in Flink would be much better achieved
through a library like ND4J, the engineering burden would be too much
otherwise.

Regards,
Theodore

On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <ka...@gmail.com>
wrote:

> Hello, guys.
>
> Theodore, last week I started the review of the PR:
> https://github.com/apache/flink/pull/2735 related to *word2Vec for Flink*.
>
>
>
> During this review I have asked myself: why do we need to implement such a
> very popular algorithm like *word2vec one more time*, when there is already
> available implementation in java provided by deeplearning4j.org
> <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2 licence).
> This library tries to promote itself, there is a hype around it in ML
> sphere, and it was integrated with Apache Spark, to provide scalable
> deeplearning calculations.
>
>
> *That's why I thought: could we integrate with this library or not also and
> Flink? *
>
> 1) Personally I think, providing support and deployment of
> *Deeplearning(DL)
> algorithms/models in Flink* is promising and attractive feature, because:
>
>     a) during last two years DL proved its efficiency and these algorithms
> used in many applications. For example *Spotify *uses DL based algorithms
> for music content extraction: Recommending music on Spotify with deep
> learning AUGUST 05, 2014
> <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their music
> recommendations. Developers need to scale up DL manually, that causes a lot
> of work, so that’s why such platforms like Flink should support these
> models deployment.
>
>     b) Here is presented the scope of Deeplearning usage cases
> <https://deeplearning4j.org/use_cases>, so many of this scenarios related
> to scenarios, that could be supported on Flink.
>
>
> 2) But DL uncover such questions like:
>
>     a) scale up calculations over machines
>
>     b) perform these calculations both over CPU and GPU. GPU is required to
> train big DL models, otherwise learning process could have very slow
> convergence.
>
>
> 3) I have checked this DL4J library, which already have reach support of
> many attractive DL models like: Recurrent Networks and LSTMs, Convolutional
> Networks (CNN), Restricted Boltzmann Machines (RBM) and others. So we won’t
> need to implement them independently, but only provide the ability of
> execution of this models over Flink cluster, the quite similar way like it
> was integrated with Apache Spark.
>
>
> Because of all of this I propose:
>
> 1)    To create new ticket in Flink’s JIRA for integration of Flink with
> DL4J and decide on which side this integration should be implemented.
>
> 2)    Support natively GPU resources in Flink and allow calculations over
> them, like that is described in this publication
> https://www.oreilly.com/learning/accelerating-spark-workloads-using-gpus
>
>
>
> *Regarding original issue Implement Word2Vec
> <https://issues.apache.org/jira/browse/FLINK-2094>in Flink,  *I have
> investigated its implementation in DL4J and  that implementation of
> integration DL4J with Apache Spark, and got several points:
>
> It seems that idea of building of our own implementation of word2vec in
> Flink not such a bad solution, because: This DL4J was forced to reimplement
> its original word2Vec over Spark. I have checked the integration of DL4J
> with Spark, and found that it is too strongly coupled with Spark API, so
> that it is impossible just to take some DL4J API and reuse it, instead we
> need to implement independent integration for Flink.
>
> *That’s why we simply finish implementation of current PR
> **independently **from
> integration to DL4J.*
>
>
>
> Could you please provide your opinion regarding my questions and points,
> what do you think about them?
>
>
>
> пн, 6 февр. 2017 г. в 12:51, Katherin Eri <ka...@gmail.com>:
>
> > Sorry, guys I need to finish this letter first.
> >   Full version of it will come shortly.
> >
> > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <ka...@gmail.com>:
> >
> > Hello, guys.
> > Theodore, last week I started the review of the PR:
> > https://github.com/apache/flink/pull/2735 related to *word2Vec for
> Flink*.
> >
> > During this review I have asked myself: why do we need to implement such
> a
> > very popular algorithm like *word2vec one more time*, when there is
> > already availabe implementation in java provided by deeplearning4j.org
> > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2
> licence).
> > This library tries to promote it self, there is a hype around it in ML
> > sphere, and  it was integrated with Apache Spark, to provide scalable
> > deeplearning calculations.
> > That's why I thought: could we integrate with this library or not also
> and
> > Flink?
> > 1) Personally I think, providing support and deployment of Deeplearning
> > algorithms/models in Flink is promising and attractive feature, because:
> >     a) during last two years deeplearning proved its efficiency and this
> > algorithms used in many applications. For example *Spotify *uses DL based
> > algorithms for music content extraction: Recommending music on Spotify
> > with deep learning AUGUST 05, 2014
> > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their music
> > recommendations. Doing this natively scalable is very attractive.
> >
> >
> > I have investigated that implementation of integration DL4J with Apache
> > Spark, and got several points:
> >
> > 1) It seems that idea of building of our own implementation of word2vec
> > not such a bad solution, because the integration of DL4J with Spark is
> too
> > strongly coupled with Saprk API and it will take time from the side of
> DL4J
> > to adopt this integration to Flink. Also I have expected that we will be
> > able to call just some API, it is not such thing.
> > 2)
> >
> > https://deeplearning4j.org/use_cases
> > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> implementation-r-python/
> >
> >
> > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <tr...@apache.org>:
> >
> > Hi Katherin,
> >
> > welcome to the Flink community. Always great to see new people joining
> the
> > community :-)
> >
> > Cheers,
> > Till
> >
> > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> katherinmail@gmail.com>
> > wrote:
> >
> > > ok, I've got it.
> > > I will take a look at  https://github.com/apache/flink/pull/2735.
> > >
> > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > > theodoros.vasiloudis@gmail.com>:
> > >
> > > > Hello Katherin,
> > > >
> > > > Welcome to the Flink community!
> > > >
> > > > The ML component definitely needs a lot of work you are correct, we
> are
> > > > facing similar problems to CEP, which we'll hopefully resolve with
> the
> > > > restructuring Stephan has mentioned in that thread.
> > > >
> > > > If you'd like to help out with PRs we have many open, one I have
> > started
> > > > reviewing but got side-tracked is the Word2Vec one [1].
> > > >
> > > > Best,
> > > > Theodore
> > > >
> > > > [1] https://github.com/apache/flink/pull/2735
> > > >
> > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <fh...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Katherin,
> > > > >
> > > > > welcome to the Flink community!
> > > > > Help with reviewing PRs is always very welcome and a great way to
> > > > > contribute.
> > > > >
> > > > > Best, Fabian
> > > > >
> > > > >
> > > > >
> > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> katherinmail@gmail.com
> > >:
> > > > >
> > > > > > Thank you, Timo.
> > > > > > I have started the analysis of the topic.
> > > > > > And if it necessary, I will try to perform the review of other
> > pulls)
> > > > > >
> > > > > >
> > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <tw...@apache.org>:
> > > > > >
> > > > > > > Hi Katherin,
> > > > > > >
> > > > > > > great to hear that you would like to contribute! Welcome!
> > > > > > >
> > > > > > > I gave you contributor permissions. You can now assign issues
> to
> > > > > > > yourself. I assigned FLINK-1750 to you.
> > > > > > > Right now there are many open ML pull requests, you are very
> > > welcome
> > > > to
> > > > > > > review the code of others, too.
> > > > > > >
> > > > > > > Timo
> > > > > > >
> > > > > > >
> > > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> > > > > > > > Hello, All!
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > I'm Kate Eri, I'm java developer with 6-year enterprise
> > > experience,
> > > > > > also
> > > > > > > I
> > > > > > > > have some expertise with scala (half of the year).
> > > > > > > >
> > > > > > > > Last 2 years I have participated in several BigData projects
> > that
> > > > > were
> > > > > > > > related to Machine Learning (Time series analysis,
> Recommender
> > > > > systems,
> > > > > > > > Social networking) and ETL. I have experience with Hadoop,
> > Apache
> > > > > Spark
> > > > > > > and
> > > > > > > > Hive.
> > > > > > > >
> > > > > > > >
> > > > > > > > I’m fond of ML topic, and I see that Flink project requires
> > some
> > > > work
> > > > > > in
> > > > > > > > this area, so that’s why I would like to join Flink and ask
> me
> > to
> > > > > grant
> > > > > > > the
> > > > > > > > assignment of the ticket
> > > > > > > https://issues.apache.org/jira/browse/FLINK-1750
> > > > > > > > to me.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
>

Re: New Flink team member - Kate Eri.

Posted by Katherin Eri <ka...@gmail.com>.

Hello, guys.

Theodore, last week I started the review of the PR:
https://github.com/apache/flink/pull/2735 related to *word2Vec for Flink*.



During this review I have asked myself: why do we need to implement such a
very popular algorithm like *word2vec one more time*, when there is already
available implementation in java provided by deeplearning4j.org
<https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2 licence).
This library tries to promote itself, there is a hype around it in ML
sphere, and it was integrated with Apache Spark, to provide scalable
deeplearning calculations.


*That's why I thought: could we integrate with this library or not also and
Flink? *

1) Personally I think, providing support and deployment of *Deeplearning(DL)
algorithms/models in Flink* is promising and attractive feature, because:

    a) during last two years DL proved its efficiency and these algorithms
used in many applications. For example *Spotify *uses DL based algorithms
for music content extraction: Recommending music on Spotify with deep
learning AUGUST 05, 2014
<http://benanne.github.io/2014/08/05/spotify-cnns.html> for their music
recommendations. Developers need to scale up DL manually, that causes a lot
of work, so that’s why such platforms like Flink should support these
models deployment.

    b) Here is presented the scope of Deeplearning usage cases
<https://deeplearning4j.org/use_cases>, so many of this scenarios related
to scenarios, that could be supported on Flink.


2) But DL uncover such questions like:

    a) scale up calculations over machines

    b) perform these calculations both over CPU and GPU. GPU is required to
train big DL models, otherwise learning process could have very slow
convergence.


3) I have checked this DL4J library, which already have reach support of
many attractive DL models like: Recurrent Networks and LSTMs, Convolutional
Networks (CNN), Restricted Boltzmann Machines (RBM) and others. So we won’t
need to implement them independently, but only provide the ability of
execution of this models over Flink cluster, the quite similar way like it
was integrated with Apache Spark.


Because of all of this I propose:

1)    To create new ticket in Flink’s JIRA for integration of Flink with
DL4J and decide on which side this integration should be implemented.

2)    Support natively GPU resources in Flink and allow calculations over
them, like that is described in this publication
https://www.oreilly.com/learning/accelerating-spark-workloads-using-gpus



*Regarding original issue Implement Word2Vec
<https://issues.apache.org/jira/browse/FLINK-2094>in Flink,  *I have
investigated its implementation in DL4J and  that implementation of
integration DL4J with Apache Spark, and got several points:

It seems that idea of building of our own implementation of word2vec in
Flink not such a bad solution, because: This DL4J was forced to reimplement
its original word2Vec over Spark. I have checked the integration of DL4J
with Spark, and found that it is too strongly coupled with Spark API, so
that it is impossible just to take some DL4J API and reuse it, instead we
need to implement independent integration for Flink.

*That’s why we simply finish implementation of current PR
**independently **from
integration to DL4J.*



Could you please provide your opinion regarding my questions and points,
what do you think about them?



пн, 6 февр. 2017 г. в 12:51, Katherin Eri <ka...@gmail.com>:

> Sorry, guys I need to finish this letter first.
>   Full version of it will come shortly.
>
> пн, 6 февр. 2017 г. в 12:49, Katherin Eri <ka...@gmail.com>:
>
> Hello, guys.
> Theodore, last week I started the review of the PR:
> https://github.com/apache/flink/pull/2735 related to *word2Vec for Flink*.
>
> During this review I have asked myself: why do we need to implement such a
> very popular algorithm like *word2vec one more time*, when there is
> already availabe implementation in java provided by deeplearning4j.org
> <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2 licence).
> This library tries to promote it self, there is a hype around it in ML
> sphere, and  it was integrated with Apache Spark, to provide scalable
> deeplearning calculations.
> That's why I thought: could we integrate with this library or not also and
> Flink?
> 1) Personally I think, providing support and deployment of Deeplearning
> algorithms/models in Flink is promising and attractive feature, because:
>     a) during last two years deeplearning proved its efficiency and this
> algorithms used in many applications. For example *Spotify *uses DL based
> algorithms for music content extraction: Recommending music on Spotify
> with deep learning AUGUST 05, 2014
> <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their music
> recommendations. Doing this natively scalable is very attractive.
>
>
> I have investigated that implementation of integration DL4J with Apache
> Spark, and got several points:
>
> 1) It seems that idea of building of our own implementation of word2vec
> not such a bad solution, because the integration of DL4J with Spark is too
> strongly coupled with Saprk API and it will take time from the side of DL4J
> to adopt this integration to Flink. Also I have expected that we will be
> able to call just some API, it is not such thing.
> 2)
>
> https://deeplearning4j.org/use_cases
> https://www.analyticsvidhya.com/blog/2017/01/t-sne-implementation-r-python/
>
>
> чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <tr...@apache.org>:
>
> Hi Katherin,
>
> welcome to the Flink community. Always great to see new people joining the
> community :-)
>
> Cheers,
> Till
>
> On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <ka...@gmail.com>
> wrote:
>
> > ok, I've got it.
> > I will take a look at  https://github.com/apache/flink/pull/2735.
> >
> > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > theodoros.vasiloudis@gmail.com>:
> >
> > > Hello Katherin,
> > >
> > > Welcome to the Flink community!
> > >
> > > The ML component definitely needs a lot of work you are correct, we are
> > > facing similar problems to CEP, which we'll hopefully resolve with the
> > > restructuring Stephan has mentioned in that thread.
> > >
> > > If you'd like to help out with PRs we have many open, one I have
> started
> > > reviewing but got side-tracked is the Word2Vec one [1].
> > >
> > > Best,
> > > Theodore
> > >
> > > [1] https://github.com/apache/flink/pull/2735
> > >
> > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <fh...@gmail.com>
> > wrote:
> > >
> > > > Hi Katherin,
> > > >
> > > > welcome to the Flink community!
> > > > Help with reviewing PRs is always very welcome and a great way to
> > > > contribute.
> > > >
> > > > Best, Fabian
> > > >
> > > >
> > > >
> > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <katherinmail@gmail.com
> >:
> > > >
> > > > > Thank you, Timo.
> > > > > I have started the analysis of the topic.
> > > > > And if it necessary, I will try to perform the review of other
> pulls)
> > > > >
> > > > >
> > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <tw...@apache.org>:
> > > > >
> > > > > > Hi Katherin,
> > > > > >
> > > > > > great to hear that you would like to contribute! Welcome!
> > > > > >
> > > > > > I gave you contributor permissions. You can now assign issues to
> > > > > > yourself. I assigned FLINK-1750 to you.
> > > > > > Right now there are many open ML pull requests, you are very
> > welcome
> > > to
> > > > > > review the code of others, too.
> > > > > >
> > > > > > Timo
> > > > > >
> > > > > >
> > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> > > > > > > Hello, All!
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I'm Kate Eri, I'm java developer with 6-year enterprise
> > experience,
> > > > > also
> > > > > > I
> > > > > > > have some expertise with scala (half of the year).
> > > > > > >
> > > > > > > Last 2 years I have participated in several BigData projects
> that
> > > > were
> > > > > > > related to Machine Learning (Time series analysis, Recommender
> > > > systems,
> > > > > > > Social networking) and ETL. I have experience with Hadoop,
> Apache
> > > > Spark
> > > > > > and
> > > > > > > Hive.
> > > > > > >
> > > > > > >
> > > > > > > I’m fond of ML topic, and I see that Flink project requires
> some
> > > work
> > > > > in
> > > > > > > this area, so that’s why I would like to join Flink and ask me
> to
> > > > grant
> > > > > > the
> > > > > > > assignment of the ticket
> > > > > > https://issues.apache.org/jira/browse/FLINK-1750
> > > > > > > to me.
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>

Re: New Flink team member - Kate Eri.

Posted by Katherin Eri <ka...@gmail.com>.

Sorry, guys I need to finish this letter first.
  Full version of it will come shortly.

пн, 6 февр. 2017 г. в 12:49, Katherin Eri <ka...@gmail.com>:

> Hello, guys.
> Theodore, last week I started the review of the PR:
> https://github.com/apache/flink/pull/2735 related to *word2Vec for Flink*.
>
> During this review I have asked myself: why do we need to implement such a
> very popular algorithm like *word2vec one more time*, when there is
> already availabe implementation in java provided by deeplearning4j.org
> <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2 licence).
> This library tries to promote it self, there is a hype around it in ML
> sphere, and  it was integrated with Apache Spark, to provide scalable
> deeplearning calculations.
> That's why I thought: could we integrate with this library or not also and
> Flink?
> 1) Personally I think, providing support and deployment of Deeplearning
> algorithms/models in Flink is promising and attractive feature, because:
>     a) during last two years deeplearning proved its efficiency and this
> algorithms used in many applications. For example *Spotify *uses DL based
> algorithms for music content extraction: Recommending music on Spotify
> with deep learning AUGUST 05, 2014
> <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their music
> recommendations. Doing this natively scalable is very attractive.
>
>
> I have investigated that implementation of integration DL4J with Apache
> Spark, and got several points:
>
> 1) It seems that idea of building of our own implementation of word2vec
> not such a bad solution, because the integration of DL4J with Spark is too
> strongly coupled with Saprk API and it will take time from the side of DL4J
> to adopt this integration to Flink. Also I have expected that we will be
> able to call just some API, it is not such thing.
> 2)
>
> https://deeplearning4j.org/use_cases
> https://www.analyticsvidhya.com/blog/2017/01/t-sne-implementation-r-python/
>
>
> чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <tr...@apache.org>:
>
> Hi Katherin,
>
> welcome to the Flink community. Always great to see new people joining the
> community :-)
>
> Cheers,
> Till
>
> On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <ka...@gmail.com>
> wrote:
>
> > ok, I've got it.
> > I will take a look at  https://github.com/apache/flink/pull/2735.
> >
> > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > theodoros.vasiloudis@gmail.com>:
> >
> > > Hello Katherin,
> > >
> > > Welcome to the Flink community!
> > >
> > > The ML component definitely needs a lot of work you are correct, we are
> > > facing similar problems to CEP, which we'll hopefully resolve with the
> > > restructuring Stephan has mentioned in that thread.
> > >
> > > If you'd like to help out with PRs we have many open, one I have
> started
> > > reviewing but got side-tracked is the Word2Vec one [1].
> > >
> > > Best,
> > > Theodore
> > >
> > > [1] https://github.com/apache/flink/pull/2735
> > >
> > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <fh...@gmail.com>
> > wrote:
> > >
> > > > Hi Katherin,
> > > >
> > > > welcome to the Flink community!
> > > > Help with reviewing PRs is always very welcome and a great way to
> > > > contribute.
> > > >
> > > > Best, Fabian
> > > >
> > > >
> > > >
> > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <katherinmail@gmail.com
> >:
> > > >
> > > > > Thank you, Timo.
> > > > > I have started the analysis of the topic.
> > > > > And if it necessary, I will try to perform the review of other
> pulls)
> > > > >
> > > > >
> > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <tw...@apache.org>:
> > > > >
> > > > > > Hi Katherin,
> > > > > >
> > > > > > great to hear that you would like to contribute! Welcome!
> > > > > >
> > > > > > I gave you contributor permissions. You can now assign issues to
> > > > > > yourself. I assigned FLINK-1750 to you.
> > > > > > Right now there are many open ML pull requests, you are very
> > welcome
> > > to
> > > > > > review the code of others, too.
> > > > > >
> > > > > > Timo
> > > > > >
> > > > > >
> > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> > > > > > > Hello, All!
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I'm Kate Eri, I'm java developer with 6-year enterprise
> > experience,
> > > > > also
> > > > > > I
> > > > > > > have some expertise with scala (half of the year).
> > > > > > >
> > > > > > > Last 2 years I have participated in several BigData projects
> that
> > > > were
> > > > > > > related to Machine Learning (Time series analysis, Recommender
> > > > systems,
> > > > > > > Social networking) and ETL. I have experience with Hadoop,
> Apache
> > > > Spark
> > > > > > and
> > > > > > > Hive.
> > > > > > >
> > > > > > >
> > > > > > > I’m fond of ML topic, and I see that Flink project requires
> some
> > > work
> > > > > in
> > > > > > > this area, so that’s why I would like to join Flink and ask me
> to
> > > > grant
> > > > > > the
> > > > > > > assignment of the ticket
> > > > > > https://issues.apache.org/jira/browse/FLINK-1750
> > > > > > > to me.
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>