You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Vasiliki Kalavri <va...@gmail.com> on 2017/02/24 15:39:19 UTC

[DISCUSS] Gelly planning for release 1.3 and roadmap

Hello squirrels,

this is a discussion thread to organize the Gelly component development for
release 1.3 and discuss longer-term plans for the library.

I am hoping that with time-based releases, we can distribute the load for
PR reviewing and make better use of our time, and also point contributors
to "useful" tickets when they offer to help.

I'm expecting the outcome of this discussion to be:

(1) a set of open PRs to review and try merging for 1.3
(2) a set of open JIRAs to work-on before feature freeze
(3) a set of JIRAs and PRs to reorganize/close
(4) ideas on possible FLIPs

Here's my initial take on things, i.e. features *I* see as important in the
short-term. Feel free to add/remove/discuss:

Release 1.3
==========
- Bipartite graph support. Initial support has been added, but there
are unreviewed
PRs
<https://github.com/apache/flink/pulls?utf8=%E2%9C%93&q=is%3Apr%20is%3Aopen%20bipartite%20>
and there is no Scala API yet. It would be nice to organize this feature,
decide what functionality we need and what functionality is already covered
by the Graph type and have proper bipartite support for 1.3.
- Driver improvements, i.e. #3294
<https://github.com/apache/flink/pull/3294>
- Algorithm improvements, #2733 <https://github.com/apache/flink/pull/2733>
- Affinity Propagation algorithm. This one has been developed using a bulk
iteration plan and needs a review. The PR is #2885
<https://github.com/apache/flink/pull/2885>.
- Object reuse issues, FLINK-5890, FLINK-5891
- Vertex-centric iteration improvement, i.e. FLINK-5127


Roadmap
========
Regarding longer-term plans, I see the following issues as still being
relevant from the existing roadmap [1]:
- Extending the iteration functionality to support algorithms, more complex
than value-propagation, e.g. with nested loops
- Partitioning methods
- Partition-centric iterations
- Performance evaluation

These two lists are by no means complete or final and the goal of this
thread is to see what the community is interested in, whether these
features / additions make sense to be worked on, or what features are
missing.
So, please provide your feedback!

Cheers,
-V.

[1]: https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly

Re: [DISCUSS] Gelly planning for release 1.3 and roadmap

Posted by Greg Hogan <co...@greghogan.com>.
On Fri, Feb 24, 2017 at 1:43 PM, Vasiliki Kalavri <vasilikikalavri@gmail.com <ma...@gmail.com>> wrote:
Hi Greg,

On 24 February 2017 at 18:09, Greg Hogan <code@greghogan.com <ma...@greghogan.com>> wrote:

> Thanks, Vasia, for starting the discussion.
>
> I was expecting more changes from the recent discussion on restructuring
> the project, in particular regarding the libraries. Gelly has always
> collected algorithms and I have personally taken an algorithms-first
> approach for contributions. Is that manageable and maintainable? I'd prefer
> to see no limit to good contributions, and if necessary split the codebase
> or the project.

​I don't think there should be a limit either. I do think though that
development should be community-driven, i.e. not making contributions just
for the sake of it, but evaluating their benefit first.
The library already has a quite long list of algorithms. Shall we keep on
extending it? And if yes, how do we choose which algorithms to add? Do we
accept any algorithm even if it hasn't been asked by anyone? So far, we've
added algorithms that we thought were useful and common. But continuing to
extend the library like this doesn't seem maintainable to me, because we
might end up with a lot of code to maintain that nobody uses. On the other
hand, adding more algorithms might attract more users, so I see a trade-off
there.

I only count 10 algorithms (with a few new in review). I’m not envisioning users thinking of their algorithm then scouring the web looking for a great implementation. I think Gelly reaches a larger audience as a diverse collection of well-implemented large scale algorithms on a distributed streaming processor. We do want to provide algorithms for which Flink is well-suited (parallel algorithms; large, sparse datasets; variable length results).

Flink has a stable API so maintenance should be minimized.

AdamicAdar
ClusteringCoefficient
CommunityDetection
ConnectedComponents
HITS
JaccardIndex
LabelPropagation
PageRank
SSSP
Summarization
 
> If so, then a secondary goal is to make the algorithms user-accessible and
> easier to review (especially at scale!). FLINK-4949 rewrites
> flink-gelly-examples with modular inputs and algorithms, allows users to
> run all existing algorithms, and makes it trivial to create a driver for
> new algorithms (and when comparing different implementations).

​I'm +1 for anything that makes using existing functionality easier.
FLINK-4949 sounds like a great addition. Could you maybe extend the JIRA
and/or PR description a bit? I understand the rationale but it would be
nice to have a high-level description of the changes and the new
functionality that the PR adds or the interfaces it modifies. Otherwise, it
will be difficult to review a PR with +5k line changes :)

I’ve broken this into sub-tasks. I was anticipating a chop-and-rebase review.
 
> Regarding BipartiteGraphs, without algorithms or ideas for algorithms it's
> not possible to review the structure of the open pull requests.

​I'm not sure I understand this point. There was a design document and an
extensive discussion on this issue. Do you think we should revisit? Some
common algorithms for bipartitite graphs that I am aware of is SALSA for
recommendations and relevance search for anomaly detection.

Gelly supports both directed and undirected graphs using a single `Graph` class. Do these new bipartite algorithms require a new BipartiteGraph class or can they reuse just an Edge set? I’ve added some comments to the open PRs that would reduce the amount of code, and my concerns are only maintainability and interoperability.
 
> +1 to evaluating performance and promoting Flink!
>
> Gelly has two shepherds whereas CEP and ML share one committer. New
> algorithms in Gelly require new features in the Batch API (Gelly may also
> start doing streaming, we're cool kids, too)

​^^​

> so we need to find a process
> for snuffing ideas early and for the right balance in dependence on core
> committers' time. For example, reworking the iteration scheduler to allow
> for intermediate outputs and nested iterations. Can this feature be
> developed and reviewed within Gelly? Does it need the blessing of a Stephan
> or Fabian? I'd like to see contributors and committers less dependent on
> the core team and more autonomous.
>

​What do you mean ​developed and reviewed ​"within Gelly"?
​This feature would require changes in the batch iterations code and will
probably need to be proposed and reviewed as a FLIP, so it would need the
blessing of the community :)

Having someone who is more familiar with this part of the code help is of
course favorable, but I don't think it's absolutely necessary.

I don’t know that this under-the-hood improvement would necessitate a FLIP but I also do not yet understand the changes required. My thought was only that if this is a very important feature for Gelly, we could plan, develop, and do the initial reviews among the Gelly contributors. Another example would be FLINK-3695 which I initially envisioned adding ValueArray types to flink-core but now think would be better to integrate first into Gelly.

Re: [DISCUSS] Gelly planning for release 1.3 and roadmap

Posted by Vasiliki Kalavri <va...@gmail.com>.
Hi Xingcan,

thank you for your input!

On 27 February 2017 at 14:03, Xingcan Cui <xi...@gmail.com> wrote:

> Hi Vasia and Greg,
>
> thanks for the discussion. I'd like to share my thoughts.
>
> 1) I don't think it's necessary to extend the algorithm list intentionally.
> It's just like a textbook that can not cover all the existing algorithms
> (even if we can). Just representative and commonly used ones will be
> enough. After all, Gelly is mainly designed for providing a framework
> rather than an algorithm library. Besides, it seems that Gelly's API is not
> stable now and thus a huge work of refactoring or even rewriting will rise
> once the API changes.
>

​In fact, Gelly's main APIs haven't changed much throughout the last 3
releases.
Maybe it's time we explicitly mark stable API​s for 1.3.



>
> 2) Unlike other "pure" graph computing framework (e.g. giraph), Gelly is
> built on top of Flink, which means that it can only use operations that
> provided by it. In my own opinion, Flink's batch processing is not so
> outstanding as it's stream. As Grey said, one problem lies on intermediate
> results caching. Though it's not clear for me (I'm still a ignorant new
> comer...) why this feature has not been implemented for such a long time,
> there must be some reasons. What I see is that, to some extent, it's
> already obstructed Gelly's development. From this point of view,
> self-blessing is better than blessing from others and I'm sure some MLers
> may be more anxious than us :) So, I guess "within Gelly" just means a
> Gelly-driven development?
>
> In a nutshell, I will encourage more concentrations on Gelly's API (or even
> related Flink's API if necessary), rather than high-level things (e.g.
> algorithms, performance) on top of it. What if we can change both the
> edges' values and vertices' values during an iteration one day? :)
>

​Changing both edge and vertex values during an iteration has also been
brought up before.
This one could be implemented by providing an alternative representation of
the graph (e.g. adjacency list)
and would (hopefully) leave existing iteration APIs unchanged. I'm onboard
with adding this to the roadmap​.

Best,
-Vasia.


>
> Best,
> Xingcan
>
>
> On Sat, Feb 25, 2017 at 2:43 AM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com
> > wrote:
>
> > Hi Greg,
> >
> > On 24 February 2017 at 18:09, Greg Hogan <co...@greghogan.com> wrote:
> >
> > > Thanks, Vasia, for starting the discussion.
> > >
> > > I was expecting more changes from the recent discussion on
> restructuring
> > > the project, in particular regarding the libraries. Gelly has always
> > > collected algorithms and I have personally taken an algorithms-first
> > > approach for contributions. Is that manageable and maintainable? I'd
> > prefer
> > > to see no limit to good contributions, and if necessary split the
> > codebase
> > > or the project.
> > >
> >
> > ​I don't think there should be a limit either. I do think though that
> > development should be community-driven, i.e. not making contributions
> just
> > for the sake of it, but evaluating their benefit first.
> > The library already has a quite long list of algorithms. Shall we keep on
> > extending it? And if yes, how do we choose which algorithms to add? Do we
> > accept any algorithm even if it hasn't been asked by anyone? So far,
> we've
> > added algorithms that we thought were useful and common. But continuing
> to
> > extend the library like this doesn't seem maintainable to me, because we
> > might end up with a lot of code to maintain that nobody uses. On the
> other
> > hand, adding more algorithms might attract more users, so I see a
> trade-off
> > there.
> >
> >
> > >
> > > If so, then a secondary goal is to make the algorithms user-accessible
> > and
> > > easier to review (especially at scale!). FLINK-4949 rewrites
> > > flink-gelly-examples with modular inputs and algorithms, allows users
> to
> > > run all existing algorithms, and makes it trivial to create a driver
> for
> > > new algorithms (and when comparing different implementations).
> > >
> >
> > ​I'm +1 for anything that makes using existing functionality easier.
> > FLINK-4949 sounds like a great addition. Could you maybe extend the JIRA
> > and/or PR description a bit? I understand the rationale but it would be
> > nice to have a high-level description of the changes and the new
> > functionality that the PR adds or the interfaces it modifies. Otherwise,
> it
> > will be difficult to review a PR with +5k line changes :)
> >
> >
> >
> > >
> > > Regarding BipartiteGraphs, without algorithms or ideas for algorithms
> > it's
> > > not possible to review the structure of the open pull requests.
> > >
> >
> >
> > ​I'm not sure I understand this point. There was a design document and an
> > extensive discussion on this issue. Do you think we should revisit? Some
> > common algorithms for bipartitite graphs that I am aware of is SALSA for
> > recommendations and relevance search for anomaly detection.
> >
> >
> >
> > >
> > > +1 to evaluating performance and promoting Flink!
> > >
> > > Gelly has two shepherds whereas CEP and ML share one committer. New
> > > algorithms in Gelly require new features in the Batch API (Gelly may
> also
> > > start doing streaming, we're cool kids, too)
> >
> >
> > ​^^​
> >
> >
> > > so we need to find a process
> > > for snuffing ideas early and for the right balance in dependence on
> core
> > > committers' time. For example, reworking the iteration scheduler to
> allow
> > > for intermediate outputs and nested iterations. Can this feature be
> > > developed and reviewed within Gelly?
> >
> > Does it need the blessing of a Stephan
> > > or Fabian? I'd like to see contributors and committers less dependent
> on
> > > the core team and more autonomous.
> > >
> >
> > ​What do you mean
> > ​developed and reviewed ​
> > "within Gelly"?
> > ​This feature would require changes in the batch iterations code and will
> > probably need to be proposed and reviewed as a FLIP, so it would need the
> > blessing of the community :)
> >
> > Having someone who is more familiar with this part of the code help is of
> > course favorable, but I don't think it's absolutely necessary.
> >
> > ​-V.​
> >
> >
> > > Greg
> > >
> > > On Fri, Feb 24, 2017 at 10:39 AM, Vasiliki Kalavri <
> > > vasilikikalavri@gmail.com> wrote:
> > >
> > > > Hello squirrels,
> > > >
> > > > this is a discussion thread to organize the Gelly component
> development
> > > for
> > > > release 1.3 and discuss longer-term plans for the library.
> > > >
> > > > I am hoping that with time-based releases, we can distribute the load
> > for
> > > > PR reviewing and make better use of our time, and also point
> > contributors
> > > > to "useful" tickets when they offer to help.
> > > >
> > > > I'm expecting the outcome of this discussion to be:
> > > >
> > > > (1) a set of open PRs to review and try merging for 1.3
> > > > (2) a set of open JIRAs to work-on before feature freeze
> > > > (3) a set of JIRAs and PRs to reorganize/close
> > > > (4) ideas on possible FLIPs
> > > >
> > > > Here's my initial take on things, i.e. features *I* see as important
> in
> > > the
> > > > short-term. Feel free to add/remove/discuss:
> > > >
> > > > Release 1.3
> > > > ==========
> > > > - Bipartite graph support. Initial support has been added, but there
> > > > are unreviewed
> > > > PRs
> > > > <https://github.com/apache/flink/pulls?utf8=%E2%9C%93&q=
> > > > is%3Apr%20is%3Aopen%20bipartite%20>
> > > > and there is no Scala API yet. It would be nice to organize this
> > feature,
> > > > decide what functionality we need and what functionality is already
> > > covered
> > > > by the Graph type and have proper bipartite support for 1.3.
> > > > - Driver improvements, i.e. #3294
> > > > <https://github.com/apache/flink/pull/3294>
> > > > - Algorithm improvements, #2733 <https://github.com/apache/fli
> > > nk/pull/2733
> > > > >
> > > > - Affinity Propagation algorithm. This one has been developed using a
> > > bulk
> > > > iteration plan and needs a review. The PR is #2885
> > > > <https://github.com/apache/flink/pull/2885>.
> > > > - Object reuse issues, FLINK-5890, FLINK-5891
> > > > - Vertex-centric iteration improvement, i.e. FLINK-5127
> > > >
> > > >
> > > > Roadmap
> > > > ========
> > > > Regarding longer-term plans, I see the following issues as still
> being
> > > > relevant from the existing roadmap [1]:
> > > > - Extending the iteration functionality to support algorithms, more
> > > complex
> > > > than value-propagation, e.g. with nested loops
> > > > - Partitioning methods
> > > > - Partition-centric iterations
> > > > - Performance evaluation
> > > >
> > > > These two lists are by no means complete or final and the goal of
> this
> > > > thread is to see what the community is interested in, whether these
> > > > features / additions make sense to be worked on, or what features are
> > > > missing.
> > > > So, please provide your feedback!
> > > >
> > > > Cheers,
> > > > -V.
> > > >
> > > > [1]: https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly
> > > >
> > >
> >
>

Re: [DISCUSS] Gelly planning for release 1.3 and roadmap

Posted by Xingcan Cui <xi...@gmail.com>.
Hi Visia and Greg,

I totally agree with you. The basic design idea behind Flink and Gelly's
API meets my personal taste well.

Marking stable API must be not easy as it looks like and I don't think I
am eligible to talk about it now : )

IMO, updating multiple datasets is essential for making Gelly "commonly
applicable". (The MST algorithm need to mark edges during the iteration and
I think there surely be other algorithms more complicated than that)

As for the intermediate caching problem, I think it should be users
themselves to decide when to cache the results and when to release them
(maybe Flink will also do the auto-release detection when a dataset will
not be accessed any more).

Graph computing on stream is really attractive and maybe we should find
some use cases first. I am not sure if this paper [1] (and the
corresponding project [2]) will help.

Best,
Xingcan

[1] http://www.vldb.org/pvldb/vol9/p1281-sharma.pdf
[2] https://github.com/twitter/GraphJet

On Thu, Mar 2, 2017 at 1:16 AM, Greg Hogan <co...@greghogan.com> wrote:

> Flink’s stable API provides the frameworks (DataStream and DataSet). On
> top of these frameworks Gelly provides additional models for iterative
> algorithms, but there are algorithms such as Minimum Spanning Tree which do
> not easily map to these models (in this instance requiring nested
> iterations; for PageRank it was handling directed graphs; for HITS it was
> processing both in- and out-edges in the same iteration).
>
> One challenge with caching results is when to release the resources.
>
> New algorithms typically require new capabilities, the latter typically
> requiring much more work, so the algorithms are virtually free.
>
> Updating multiple DataSets in an iteration should be another consideration
> for improving the scheduler. Where has this been a limitation?
>
>
> > On Feb 27, 2017, at 8:03 AM, Xingcan Cui <xi...@gmail.com> wrote:
> >
> > Hi Vasia and Greg,
> >
> > thanks for the discussion. I'd like to share my thoughts.
> >
> > 1) I don't think it's necessary to extend the algorithm list
> intentionally.
> > It's just like a textbook that can not cover all the existing algorithms
> > (even if we can). Just representative and commonly used ones will be
> > enough. After all, Gelly is mainly designed for providing a framework
> > rather than an algorithm library. Besides, it seems that Gelly's API is
> not
> > stable now and thus a huge work of refactoring or even rewriting will
> rise
> > once the API changes.
> >
> > 2) Unlike other "pure" graph computing framework (e.g. giraph), Gelly is
> > built on top of Flink, which means that it can only use operations that
> > provided by it. In my own opinion, Flink's batch processing is not so
> > outstanding as it's stream. As Grey said, one problem lies on
> intermediate
> > results caching. Though it's not clear for me (I'm still a ignorant new
> > comer...) why this feature has not been implemented for such a long time,
> > there must be some reasons. What I see is that, to some extent, it's
> > already obstructed Gelly's development. From this point of view,
> > self-blessing is better than blessing from others and I'm sure some MLers
> > may be more anxious than us :) So, I guess "within Gelly" just means a
> > Gelly-driven development?
> >
> > In a nutshell, I will encourage more concentrations on Gelly's API (or
> even
> > related Flink's API if necessary), rather than high-level things (e.g.
> > algorithms, performance) on top of it. What if we can change both the
> > edges' values and vertices' values during an iteration one day? :)
> >
> > Best,
> > Xingcan
> >
> >
> > On Sat, Feb 25, 2017 at 2:43 AM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com
> >> wrote:
> >
> >> Hi Greg,
> >>
> >> On 24 February 2017 at 18:09, Greg Hogan <co...@greghogan.com> wrote:
> >>
> >>> Thanks, Vasia, for starting the discussion.
> >>>
> >>> I was expecting more changes from the recent discussion on
> restructuring
> >>> the project, in particular regarding the libraries. Gelly has always
> >>> collected algorithms and I have personally taken an algorithms-first
> >>> approach for contributions. Is that manageable and maintainable? I'd
> >> prefer
> >>> to see no limit to good contributions, and if necessary split the
> >> codebase
> >>> or the project.
> >>>
> >>
> >> ​I don't think there should be a limit either. I do think though that
> >> development should be community-driven, i.e. not making contributions
> just
> >> for the sake of it, but evaluating their benefit first.
> >> The library already has a quite long list of algorithms. Shall we keep
> on
> >> extending it? And if yes, how do we choose which algorithms to add? Do
> we
> >> accept any algorithm even if it hasn't been asked by anyone? So far,
> we've
> >> added algorithms that we thought were useful and common. But continuing
> to
> >> extend the library like this doesn't seem maintainable to me, because we
> >> might end up with a lot of code to maintain that nobody uses. On the
> other
> >> hand, adding more algorithms might attract more users, so I see a
> trade-off
> >> there.
> >>
> >>
> >>>
> >>> If so, then a secondary goal is to make the algorithms user-accessible
> >> and
> >>> easier to review (especially at scale!). FLINK-4949 rewrites
> >>> flink-gelly-examples with modular inputs and algorithms, allows users
> to
> >>> run all existing algorithms, and makes it trivial to create a driver
> for
> >>> new algorithms (and when comparing different implementations).
> >>>
> >>
> >> ​I'm +1 for anything that makes using existing functionality easier.
> >> FLINK-4949 sounds like a great addition. Could you maybe extend the JIRA
> >> and/or PR description a bit? I understand the rationale but it would be
> >> nice to have a high-level description of the changes and the new
> >> functionality that the PR adds or the interfaces it modifies.
> Otherwise, it
> >> will be difficult to review a PR with +5k line changes :)
> >>
> >>
> >>
> >>>
> >>> Regarding BipartiteGraphs, without algorithms or ideas for algorithms
> >> it's
> >>> not possible to review the structure of the open pull requests.
> >>>
> >>
> >>
> >> ​I'm not sure I understand this point. There was a design document and
> an
> >> extensive discussion on this issue. Do you think we should revisit? Some
> >> common algorithms for bipartitite graphs that I am aware of is SALSA for
> >> recommendations and relevance search for anomaly detection.
> >>
> >>
> >>
> >>>
> >>> +1 to evaluating performance and promoting Flink!
> >>>
> >>> Gelly has two shepherds whereas CEP and ML share one committer. New
> >>> algorithms in Gelly require new features in the Batch API (Gelly may
> also
> >>> start doing streaming, we're cool kids, too)
> >>
> >>
> >> ​^^​
> >>
> >>
> >>> so we need to find a process
> >>> for snuffing ideas early and for the right balance in dependence on
> core
> >>> committers' time. For example, reworking the iteration scheduler to
> allow
> >>> for intermediate outputs and nested iterations. Can this feature be
> >>> developed and reviewed within Gelly?
> >>
> >> Does it need the blessing of a Stephan
> >>> or Fabian? I'd like to see contributors and committers less dependent
> on
> >>> the core team and more autonomous.
> >>>
> >>
> >> ​What do you mean
> >> ​developed and reviewed ​
> >> "within Gelly"?
> >> ​This feature would require changes in the batch iterations code and
> will
> >> probably need to be proposed and reviewed as a FLIP, so it would need
> the
> >> blessing of the community :)
> >>
> >> Having someone who is more familiar with this part of the code help is
> of
> >> course favorable, but I don't think it's absolutely necessary.
> >>
> >> ​-V.​
> >>
> >>
> >>> Greg
> >>>
> >>> On Fri, Feb 24, 2017 at 10:39 AM, Vasiliki Kalavri <
> >>> vasilikikalavri@gmail.com> wrote:
> >>>
> >>>> Hello squirrels,
> >>>>
> >>>> this is a discussion thread to organize the Gelly component
> development
> >>> for
> >>>> release 1.3 and discuss longer-term plans for the library.
> >>>>
> >>>> I am hoping that with time-based releases, we can distribute the load
> >> for
> >>>> PR reviewing and make better use of our time, and also point
> >> contributors
> >>>> to "useful" tickets when they offer to help.
> >>>>
> >>>> I'm expecting the outcome of this discussion to be:
> >>>>
> >>>> (1) a set of open PRs to review and try merging for 1.3
> >>>> (2) a set of open JIRAs to work-on before feature freeze
> >>>> (3) a set of JIRAs and PRs to reorganize/close
> >>>> (4) ideas on possible FLIPs
> >>>>
> >>>> Here's my initial take on things, i.e. features *I* see as important
> in
> >>> the
> >>>> short-term. Feel free to add/remove/discuss:
> >>>>
> >>>> Release 1.3
> >>>> ==========
> >>>> - Bipartite graph support. Initial support has been added, but there
> >>>> are unreviewed
> >>>> PRs
> >>>> <https://github.com/apache/flink/pulls?utf8=%E2%9C%93&q=
> >>>> is%3Apr%20is%3Aopen%20bipartite%20>
> >>>> and there is no Scala API yet. It would be nice to organize this
> >> feature,
> >>>> decide what functionality we need and what functionality is already
> >>> covered
> >>>> by the Graph type and have proper bipartite support for 1.3.
> >>>> - Driver improvements, i.e. #3294
> >>>> <https://github.com/apache/flink/pull/3294>
> >>>> - Algorithm improvements, #2733 <https://github.com/apache/fli
> >>> nk/pull/2733
> >>>>>
> >>>> - Affinity Propagation algorithm. This one has been developed using a
> >>> bulk
> >>>> iteration plan and needs a review. The PR is #2885
> >>>> <https://github.com/apache/flink/pull/2885>.
> >>>> - Object reuse issues, FLINK-5890, FLINK-5891
> >>>> - Vertex-centric iteration improvement, i.e. FLINK-5127
> >>>>
> >>>>
> >>>> Roadmap
> >>>> ========
> >>>> Regarding longer-term plans, I see the following issues as still being
> >>>> relevant from the existing roadmap [1]:
> >>>> - Extending the iteration functionality to support algorithms, more
> >>> complex
> >>>> than value-propagation, e.g. with nested loops
> >>>> - Partitioning methods
> >>>> - Partition-centric iterations
> >>>> - Performance evaluation
> >>>>
> >>>> These two lists are by no means complete or final and the goal of this
> >>>> thread is to see what the community is interested in, whether these
> >>>> features / additions make sense to be worked on, or what features are
> >>>> missing.
> >>>> So, please provide your feedback!
> >>>>
> >>>> Cheers,
> >>>> -V.
> >>>>
> >>>> [1]: https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] Gelly planning for release 1.3 and roadmap

Posted by Greg Hogan <co...@greghogan.com>.
Flink’s stable API provides the frameworks (DataStream and DataSet). On top of these frameworks Gelly provides additional models for iterative algorithms, but there are algorithms such as Minimum Spanning Tree which do not easily map to these models (in this instance requiring nested iterations; for PageRank it was handling directed graphs; for HITS it was processing both in- and out-edges in the same iteration).

One challenge with caching results is when to release the resources.

New algorithms typically require new capabilities, the latter typically requiring much more work, so the algorithms are virtually free.

Updating multiple DataSets in an iteration should be another consideration for improving the scheduler. Where has this been a limitation?


> On Feb 27, 2017, at 8:03 AM, Xingcan Cui <xi...@gmail.com> wrote:
> 
> Hi Vasia and Greg,
> 
> thanks for the discussion. I'd like to share my thoughts.
> 
> 1) I don't think it's necessary to extend the algorithm list intentionally.
> It's just like a textbook that can not cover all the existing algorithms
> (even if we can). Just representative and commonly used ones will be
> enough. After all, Gelly is mainly designed for providing a framework
> rather than an algorithm library. Besides, it seems that Gelly's API is not
> stable now and thus a huge work of refactoring or even rewriting will rise
> once the API changes.
> 
> 2) Unlike other "pure" graph computing framework (e.g. giraph), Gelly is
> built on top of Flink, which means that it can only use operations that
> provided by it. In my own opinion, Flink's batch processing is not so
> outstanding as it's stream. As Grey said, one problem lies on intermediate
> results caching. Though it's not clear for me (I'm still a ignorant new
> comer...) why this feature has not been implemented for such a long time,
> there must be some reasons. What I see is that, to some extent, it's
> already obstructed Gelly's development. From this point of view,
> self-blessing is better than blessing from others and I'm sure some MLers
> may be more anxious than us :) So, I guess "within Gelly" just means a
> Gelly-driven development?
> 
> In a nutshell, I will encourage more concentrations on Gelly's API (or even
> related Flink's API if necessary), rather than high-level things (e.g.
> algorithms, performance) on top of it. What if we can change both the
> edges' values and vertices' values during an iteration one day? :)
> 
> Best,
> Xingcan
> 
> 
> On Sat, Feb 25, 2017 at 2:43 AM, Vasiliki Kalavri <vasilikikalavri@gmail.com
>> wrote:
> 
>> Hi Greg,
>> 
>> On 24 February 2017 at 18:09, Greg Hogan <co...@greghogan.com> wrote:
>> 
>>> Thanks, Vasia, for starting the discussion.
>>> 
>>> I was expecting more changes from the recent discussion on restructuring
>>> the project, in particular regarding the libraries. Gelly has always
>>> collected algorithms and I have personally taken an algorithms-first
>>> approach for contributions. Is that manageable and maintainable? I'd
>> prefer
>>> to see no limit to good contributions, and if necessary split the
>> codebase
>>> or the project.
>>> 
>> 
>> ​I don't think there should be a limit either. I do think though that
>> development should be community-driven, i.e. not making contributions just
>> for the sake of it, but evaluating their benefit first.
>> The library already has a quite long list of algorithms. Shall we keep on
>> extending it? And if yes, how do we choose which algorithms to add? Do we
>> accept any algorithm even if it hasn't been asked by anyone? So far, we've
>> added algorithms that we thought were useful and common. But continuing to
>> extend the library like this doesn't seem maintainable to me, because we
>> might end up with a lot of code to maintain that nobody uses. On the other
>> hand, adding more algorithms might attract more users, so I see a trade-off
>> there.
>> 
>> 
>>> 
>>> If so, then a secondary goal is to make the algorithms user-accessible
>> and
>>> easier to review (especially at scale!). FLINK-4949 rewrites
>>> flink-gelly-examples with modular inputs and algorithms, allows users to
>>> run all existing algorithms, and makes it trivial to create a driver for
>>> new algorithms (and when comparing different implementations).
>>> 
>> 
>> ​I'm +1 for anything that makes using existing functionality easier.
>> FLINK-4949 sounds like a great addition. Could you maybe extend the JIRA
>> and/or PR description a bit? I understand the rationale but it would be
>> nice to have a high-level description of the changes and the new
>> functionality that the PR adds or the interfaces it modifies. Otherwise, it
>> will be difficult to review a PR with +5k line changes :)
>> 
>> 
>> 
>>> 
>>> Regarding BipartiteGraphs, without algorithms or ideas for algorithms
>> it's
>>> not possible to review the structure of the open pull requests.
>>> 
>> 
>> 
>> ​I'm not sure I understand this point. There was a design document and an
>> extensive discussion on this issue. Do you think we should revisit? Some
>> common algorithms for bipartitite graphs that I am aware of is SALSA for
>> recommendations and relevance search for anomaly detection.
>> 
>> 
>> 
>>> 
>>> +1 to evaluating performance and promoting Flink!
>>> 
>>> Gelly has two shepherds whereas CEP and ML share one committer. New
>>> algorithms in Gelly require new features in the Batch API (Gelly may also
>>> start doing streaming, we're cool kids, too)
>> 
>> 
>> ​^^​
>> 
>> 
>>> so we need to find a process
>>> for snuffing ideas early and for the right balance in dependence on core
>>> committers' time. For example, reworking the iteration scheduler to allow
>>> for intermediate outputs and nested iterations. Can this feature be
>>> developed and reviewed within Gelly?
>> 
>> Does it need the blessing of a Stephan
>>> or Fabian? I'd like to see contributors and committers less dependent on
>>> the core team and more autonomous.
>>> 
>> 
>> ​What do you mean
>> ​developed and reviewed ​
>> "within Gelly"?
>> ​This feature would require changes in the batch iterations code and will
>> probably need to be proposed and reviewed as a FLIP, so it would need the
>> blessing of the community :)
>> 
>> Having someone who is more familiar with this part of the code help is of
>> course favorable, but I don't think it's absolutely necessary.
>> 
>> ​-V.​
>> 
>> 
>>> Greg
>>> 
>>> On Fri, Feb 24, 2017 at 10:39 AM, Vasiliki Kalavri <
>>> vasilikikalavri@gmail.com> wrote:
>>> 
>>>> Hello squirrels,
>>>> 
>>>> this is a discussion thread to organize the Gelly component development
>>> for
>>>> release 1.3 and discuss longer-term plans for the library.
>>>> 
>>>> I am hoping that with time-based releases, we can distribute the load
>> for
>>>> PR reviewing and make better use of our time, and also point
>> contributors
>>>> to "useful" tickets when they offer to help.
>>>> 
>>>> I'm expecting the outcome of this discussion to be:
>>>> 
>>>> (1) a set of open PRs to review and try merging for 1.3
>>>> (2) a set of open JIRAs to work-on before feature freeze
>>>> (3) a set of JIRAs and PRs to reorganize/close
>>>> (4) ideas on possible FLIPs
>>>> 
>>>> Here's my initial take on things, i.e. features *I* see as important in
>>> the
>>>> short-term. Feel free to add/remove/discuss:
>>>> 
>>>> Release 1.3
>>>> ==========
>>>> - Bipartite graph support. Initial support has been added, but there
>>>> are unreviewed
>>>> PRs
>>>> <https://github.com/apache/flink/pulls?utf8=%E2%9C%93&q=
>>>> is%3Apr%20is%3Aopen%20bipartite%20>
>>>> and there is no Scala API yet. It would be nice to organize this
>> feature,
>>>> decide what functionality we need and what functionality is already
>>> covered
>>>> by the Graph type and have proper bipartite support for 1.3.
>>>> - Driver improvements, i.e. #3294
>>>> <https://github.com/apache/flink/pull/3294>
>>>> - Algorithm improvements, #2733 <https://github.com/apache/fli
>>> nk/pull/2733
>>>>> 
>>>> - Affinity Propagation algorithm. This one has been developed using a
>>> bulk
>>>> iteration plan and needs a review. The PR is #2885
>>>> <https://github.com/apache/flink/pull/2885>.
>>>> - Object reuse issues, FLINK-5890, FLINK-5891
>>>> - Vertex-centric iteration improvement, i.e. FLINK-5127
>>>> 
>>>> 
>>>> Roadmap
>>>> ========
>>>> Regarding longer-term plans, I see the following issues as still being
>>>> relevant from the existing roadmap [1]:
>>>> - Extending the iteration functionality to support algorithms, more
>>> complex
>>>> than value-propagation, e.g. with nested loops
>>>> - Partitioning methods
>>>> - Partition-centric iterations
>>>> - Performance evaluation
>>>> 
>>>> These two lists are by no means complete or final and the goal of this
>>>> thread is to see what the community is interested in, whether these
>>>> features / additions make sense to be worked on, or what features are
>>>> missing.
>>>> So, please provide your feedback!
>>>> 
>>>> Cheers,
>>>> -V.
>>>> 
>>>> [1]: https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly
>>>> 
>>> 
>> 


Re: [DISCUSS] Gelly planning for release 1.3 and roadmap

Posted by Xingcan Cui <xi...@gmail.com>.
Hi Vasia and Greg,

thanks for the discussion. I'd like to share my thoughts.

1) I don't think it's necessary to extend the algorithm list intentionally.
It's just like a textbook that can not cover all the existing algorithms
(even if we can). Just representative and commonly used ones will be
enough. After all, Gelly is mainly designed for providing a framework
rather than an algorithm library. Besides, it seems that Gelly's API is not
stable now and thus a huge work of refactoring or even rewriting will rise
once the API changes.

2) Unlike other "pure" graph computing framework (e.g. giraph), Gelly is
built on top of Flink, which means that it can only use operations that
provided by it. In my own opinion, Flink's batch processing is not so
outstanding as it's stream. As Grey said, one problem lies on intermediate
results caching. Though it's not clear for me (I'm still a ignorant new
comer...) why this feature has not been implemented for such a long time,
there must be some reasons. What I see is that, to some extent, it's
already obstructed Gelly's development. From this point of view,
self-blessing is better than blessing from others and I'm sure some MLers
may be more anxious than us :) So, I guess "within Gelly" just means a
Gelly-driven development?

In a nutshell, I will encourage more concentrations on Gelly's API (or even
related Flink's API if necessary), rather than high-level things (e.g.
algorithms, performance) on top of it. What if we can change both the
edges' values and vertices' values during an iteration one day? :)

Best,
Xingcan


On Sat, Feb 25, 2017 at 2:43 AM, Vasiliki Kalavri <vasilikikalavri@gmail.com
> wrote:

> Hi Greg,
>
> On 24 February 2017 at 18:09, Greg Hogan <co...@greghogan.com> wrote:
>
> > Thanks, Vasia, for starting the discussion.
> >
> > I was expecting more changes from the recent discussion on restructuring
> > the project, in particular regarding the libraries. Gelly has always
> > collected algorithms and I have personally taken an algorithms-first
> > approach for contributions. Is that manageable and maintainable? I'd
> prefer
> > to see no limit to good contributions, and if necessary split the
> codebase
> > or the project.
> >
>
> ​I don't think there should be a limit either. I do think though that
> development should be community-driven, i.e. not making contributions just
> for the sake of it, but evaluating their benefit first.
> The library already has a quite long list of algorithms. Shall we keep on
> extending it? And if yes, how do we choose which algorithms to add? Do we
> accept any algorithm even if it hasn't been asked by anyone? So far, we've
> added algorithms that we thought were useful and common. But continuing to
> extend the library like this doesn't seem maintainable to me, because we
> might end up with a lot of code to maintain that nobody uses. On the other
> hand, adding more algorithms might attract more users, so I see a trade-off
> there.
>
>
> >
> > If so, then a secondary goal is to make the algorithms user-accessible
> and
> > easier to review (especially at scale!). FLINK-4949 rewrites
> > flink-gelly-examples with modular inputs and algorithms, allows users to
> > run all existing algorithms, and makes it trivial to create a driver for
> > new algorithms (and when comparing different implementations).
> >
>
> ​I'm +1 for anything that makes using existing functionality easier.
> FLINK-4949 sounds like a great addition. Could you maybe extend the JIRA
> and/or PR description a bit? I understand the rationale but it would be
> nice to have a high-level description of the changes and the new
> functionality that the PR adds or the interfaces it modifies. Otherwise, it
> will be difficult to review a PR with +5k line changes :)
>
>
>
> >
> > Regarding BipartiteGraphs, without algorithms or ideas for algorithms
> it's
> > not possible to review the structure of the open pull requests.
> >
>
>
> ​I'm not sure I understand this point. There was a design document and an
> extensive discussion on this issue. Do you think we should revisit? Some
> common algorithms for bipartitite graphs that I am aware of is SALSA for
> recommendations and relevance search for anomaly detection.
>
>
>
> >
> > +1 to evaluating performance and promoting Flink!
> >
> > Gelly has two shepherds whereas CEP and ML share one committer. New
> > algorithms in Gelly require new features in the Batch API (Gelly may also
> > start doing streaming, we're cool kids, too)
>
>
> ​^^​
>
>
> > so we need to find a process
> > for snuffing ideas early and for the right balance in dependence on core
> > committers' time. For example, reworking the iteration scheduler to allow
> > for intermediate outputs and nested iterations. Can this feature be
> > developed and reviewed within Gelly?
>
> Does it need the blessing of a Stephan
> > or Fabian? I'd like to see contributors and committers less dependent on
> > the core team and more autonomous.
> >
>
> ​What do you mean
> ​developed and reviewed ​
> "within Gelly"?
> ​This feature would require changes in the batch iterations code and will
> probably need to be proposed and reviewed as a FLIP, so it would need the
> blessing of the community :)
>
> Having someone who is more familiar with this part of the code help is of
> course favorable, but I don't think it's absolutely necessary.
>
> ​-V.​
>
>
> > Greg
> >
> > On Fri, Feb 24, 2017 at 10:39 AM, Vasiliki Kalavri <
> > vasilikikalavri@gmail.com> wrote:
> >
> > > Hello squirrels,
> > >
> > > this is a discussion thread to organize the Gelly component development
> > for
> > > release 1.3 and discuss longer-term plans for the library.
> > >
> > > I am hoping that with time-based releases, we can distribute the load
> for
> > > PR reviewing and make better use of our time, and also point
> contributors
> > > to "useful" tickets when they offer to help.
> > >
> > > I'm expecting the outcome of this discussion to be:
> > >
> > > (1) a set of open PRs to review and try merging for 1.3
> > > (2) a set of open JIRAs to work-on before feature freeze
> > > (3) a set of JIRAs and PRs to reorganize/close
> > > (4) ideas on possible FLIPs
> > >
> > > Here's my initial take on things, i.e. features *I* see as important in
> > the
> > > short-term. Feel free to add/remove/discuss:
> > >
> > > Release 1.3
> > > ==========
> > > - Bipartite graph support. Initial support has been added, but there
> > > are unreviewed
> > > PRs
> > > <https://github.com/apache/flink/pulls?utf8=%E2%9C%93&q=
> > > is%3Apr%20is%3Aopen%20bipartite%20>
> > > and there is no Scala API yet. It would be nice to organize this
> feature,
> > > decide what functionality we need and what functionality is already
> > covered
> > > by the Graph type and have proper bipartite support for 1.3.
> > > - Driver improvements, i.e. #3294
> > > <https://github.com/apache/flink/pull/3294>
> > > - Algorithm improvements, #2733 <https://github.com/apache/fli
> > nk/pull/2733
> > > >
> > > - Affinity Propagation algorithm. This one has been developed using a
> > bulk
> > > iteration plan and needs a review. The PR is #2885
> > > <https://github.com/apache/flink/pull/2885>.
> > > - Object reuse issues, FLINK-5890, FLINK-5891
> > > - Vertex-centric iteration improvement, i.e. FLINK-5127
> > >
> > >
> > > Roadmap
> > > ========
> > > Regarding longer-term plans, I see the following issues as still being
> > > relevant from the existing roadmap [1]:
> > > - Extending the iteration functionality to support algorithms, more
> > complex
> > > than value-propagation, e.g. with nested loops
> > > - Partitioning methods
> > > - Partition-centric iterations
> > > - Performance evaluation
> > >
> > > These two lists are by no means complete or final and the goal of this
> > > thread is to see what the community is interested in, whether these
> > > features / additions make sense to be worked on, or what features are
> > > missing.
> > > So, please provide your feedback!
> > >
> > > Cheers,
> > > -V.
> > >
> > > [1]: https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly
> > >
> >
>

Re: [DISCUSS] Gelly planning for release 1.3 and roadmap

Posted by Vasiliki Kalavri <va...@gmail.com>.
Hi Greg,

On 24 February 2017 at 18:09, Greg Hogan <co...@greghogan.com> wrote:

> Thanks, Vasia, for starting the discussion.
>
> I was expecting more changes from the recent discussion on restructuring
> the project, in particular regarding the libraries. Gelly has always
> collected algorithms and I have personally taken an algorithms-first
> approach for contributions. Is that manageable and maintainable? I'd prefer
> to see no limit to good contributions, and if necessary split the codebase
> or the project.
>

​I don't think there should be a limit either. I do think though that
development should be community-driven, i.e. not making contributions just
for the sake of it, but evaluating their benefit first.
The library already has a quite long list of algorithms. Shall we keep on
extending it? And if yes, how do we choose which algorithms to add? Do we
accept any algorithm even if it hasn't been asked by anyone? So far, we've
added algorithms that we thought were useful and common. But continuing to
extend the library like this doesn't seem maintainable to me, because we
might end up with a lot of code to maintain that nobody uses. On the other
hand, adding more algorithms might attract more users, so I see a trade-off
there.


>
> If so, then a secondary goal is to make the algorithms user-accessible and
> easier to review (especially at scale!). FLINK-4949 rewrites
> flink-gelly-examples with modular inputs and algorithms, allows users to
> run all existing algorithms, and makes it trivial to create a driver for
> new algorithms (and when comparing different implementations).
>

​I'm +1 for anything that makes using existing functionality easier.
FLINK-4949 sounds like a great addition. Could you maybe extend the JIRA
and/or PR description a bit? I understand the rationale but it would be
nice to have a high-level description of the changes and the new
functionality that the PR adds or the interfaces it modifies. Otherwise, it
will be difficult to review a PR with +5k line changes :)



>
> Regarding BipartiteGraphs, without algorithms or ideas for algorithms it's
> not possible to review the structure of the open pull requests.
>


​I'm not sure I understand this point. There was a design document and an
extensive discussion on this issue. Do you think we should revisit? Some
common algorithms for bipartitite graphs that I am aware of is SALSA for
recommendations and relevance search for anomaly detection.



>
> +1 to evaluating performance and promoting Flink!
>
> Gelly has two shepherds whereas CEP and ML share one committer. New
> algorithms in Gelly require new features in the Batch API (Gelly may also
> start doing streaming, we're cool kids, too)


​^^​


> so we need to find a process
> for snuffing ideas early and for the right balance in dependence on core
> committers' time. For example, reworking the iteration scheduler to allow
> for intermediate outputs and nested iterations. Can this feature be
> developed and reviewed within Gelly?

Does it need the blessing of a Stephan
> or Fabian? I'd like to see contributors and committers less dependent on
> the core team and more autonomous.
>

​What do you mean
​developed and reviewed ​
"within Gelly"?
​This feature would require changes in the batch iterations code and will
probably need to be proposed and reviewed as a FLIP, so it would need the
blessing of the community :)

Having someone who is more familiar with this part of the code help is of
course favorable, but I don't think it's absolutely necessary.

​-V.​


> Greg
>
> On Fri, Feb 24, 2017 at 10:39 AM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com> wrote:
>
> > Hello squirrels,
> >
> > this is a discussion thread to organize the Gelly component development
> for
> > release 1.3 and discuss longer-term plans for the library.
> >
> > I am hoping that with time-based releases, we can distribute the load for
> > PR reviewing and make better use of our time, and also point contributors
> > to "useful" tickets when they offer to help.
> >
> > I'm expecting the outcome of this discussion to be:
> >
> > (1) a set of open PRs to review and try merging for 1.3
> > (2) a set of open JIRAs to work-on before feature freeze
> > (3) a set of JIRAs and PRs to reorganize/close
> > (4) ideas on possible FLIPs
> >
> > Here's my initial take on things, i.e. features *I* see as important in
> the
> > short-term. Feel free to add/remove/discuss:
> >
> > Release 1.3
> > ==========
> > - Bipartite graph support. Initial support has been added, but there
> > are unreviewed
> > PRs
> > <https://github.com/apache/flink/pulls?utf8=%E2%9C%93&q=
> > is%3Apr%20is%3Aopen%20bipartite%20>
> > and there is no Scala API yet. It would be nice to organize this feature,
> > decide what functionality we need and what functionality is already
> covered
> > by the Graph type and have proper bipartite support for 1.3.
> > - Driver improvements, i.e. #3294
> > <https://github.com/apache/flink/pull/3294>
> > - Algorithm improvements, #2733 <https://github.com/apache/fli
> nk/pull/2733
> > >
> > - Affinity Propagation algorithm. This one has been developed using a
> bulk
> > iteration plan and needs a review. The PR is #2885
> > <https://github.com/apache/flink/pull/2885>.
> > - Object reuse issues, FLINK-5890, FLINK-5891
> > - Vertex-centric iteration improvement, i.e. FLINK-5127
> >
> >
> > Roadmap
> > ========
> > Regarding longer-term plans, I see the following issues as still being
> > relevant from the existing roadmap [1]:
> > - Extending the iteration functionality to support algorithms, more
> complex
> > than value-propagation, e.g. with nested loops
> > - Partitioning methods
> > - Partition-centric iterations
> > - Performance evaluation
> >
> > These two lists are by no means complete or final and the goal of this
> > thread is to see what the community is interested in, whether these
> > features / additions make sense to be worked on, or what features are
> > missing.
> > So, please provide your feedback!
> >
> > Cheers,
> > -V.
> >
> > [1]: https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly
> >
>

Re: [DISCUSS] Gelly planning for release 1.3 and roadmap

Posted by Greg Hogan <co...@greghogan.com>.
Thanks, Vasia, for starting the discussion.

I was expecting more changes from the recent discussion on restructuring
the project, in particular regarding the libraries. Gelly has always
collected algorithms and I have personally taken an algorithms-first
approach for contributions. Is that manageable and maintainable? I'd prefer
to see no limit to good contributions, and if necessary split the codebase
or the project.

If so, then a secondary goal is to make the algorithms user-accessible and
easier to review (especially at scale!). FLINK-4949 rewrites
flink-gelly-examples with modular inputs and algorithms, allows users to
run all existing algorithms, and makes it trivial to create a driver for
new algorithms (and when comparing different implementations).

Regarding BipartiteGraphs, without algorithms or ideas for algorithms it's
not possible to review the structure of the open pull requests.

+1 to evaluating performance and promoting Flink!

Gelly has two shepherds whereas CEP and ML share one committer. New
algorithms in Gelly require new features in the Batch API (Gelly may also
start doing streaming, we're cool kids, too) so we need to find a process
for snuffing ideas early and for the right balance in dependence on core
committers' time. For example, reworking the iteration scheduler to allow
for intermediate outputs and nested iterations. Can this feature be
developed and reviewed within Gelly? Does it need the blessing of a Stephan
or Fabian? I'd like to see contributors and committers less dependent on
the core team and more autonomous.

Greg

On Fri, Feb 24, 2017 at 10:39 AM, Vasiliki Kalavri <
vasilikikalavri@gmail.com> wrote:

> Hello squirrels,
>
> this is a discussion thread to organize the Gelly component development for
> release 1.3 and discuss longer-term plans for the library.
>
> I am hoping that with time-based releases, we can distribute the load for
> PR reviewing and make better use of our time, and also point contributors
> to "useful" tickets when they offer to help.
>
> I'm expecting the outcome of this discussion to be:
>
> (1) a set of open PRs to review and try merging for 1.3
> (2) a set of open JIRAs to work-on before feature freeze
> (3) a set of JIRAs and PRs to reorganize/close
> (4) ideas on possible FLIPs
>
> Here's my initial take on things, i.e. features *I* see as important in the
> short-term. Feel free to add/remove/discuss:
>
> Release 1.3
> ==========
> - Bipartite graph support. Initial support has been added, but there
> are unreviewed
> PRs
> <https://github.com/apache/flink/pulls?utf8=%E2%9C%93&q=
> is%3Apr%20is%3Aopen%20bipartite%20>
> and there is no Scala API yet. It would be nice to organize this feature,
> decide what functionality we need and what functionality is already covered
> by the Graph type and have proper bipartite support for 1.3.
> - Driver improvements, i.e. #3294
> <https://github.com/apache/flink/pull/3294>
> - Algorithm improvements, #2733 <https://github.com/apache/flink/pull/2733
> >
> - Affinity Propagation algorithm. This one has been developed using a bulk
> iteration plan and needs a review. The PR is #2885
> <https://github.com/apache/flink/pull/2885>.
> - Object reuse issues, FLINK-5890, FLINK-5891
> - Vertex-centric iteration improvement, i.e. FLINK-5127
>
>
> Roadmap
> ========
> Regarding longer-term plans, I see the following issues as still being
> relevant from the existing roadmap [1]:
> - Extending the iteration functionality to support algorithms, more complex
> than value-propagation, e.g. with nested loops
> - Partitioning methods
> - Partition-centric iterations
> - Performance evaluation
>
> These two lists are by no means complete or final and the goal of this
> thread is to see what the community is interested in, whether these
> features / additions make sense to be worked on, or what features are
> missing.
> So, please provide your feedback!
>
> Cheers,
> -V.
>
> [1]: https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly
>