You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Vasiliki Kalavri <va...@gmail.com> on 2015/02/19 21:03:30 UTC

[DISCUSS] Gelly iteration abstractions

Hello beautiful Flink people,

during the past few days, Andra and I have been discussing about how to
extend Gelly's iteration methods.

Alexander's course (and his awesome students) has made it obvious that
vertex-centric iterations are not the best fit for algorithms which don't
follow the common "propagate-update" pattern. For example, Andra is working
on an implementation of Minimum Spanning Tree, which requires branching
inside an iteration and also requires a convergence check of an internal
iteration. Others also reported similar issues [1, 2]. Trying to fit such
algorithms to the vertex-centric model leads to long and ugly code, e.g.
aggregators to keep track of algorithm phases, duplicating data, etc.

One limitation of the vertex-centric and the upcoming GAS model is that
they both only allow the vertex values to be updated in each iteration.
However, for some algorithms we need to update the edge values and in
others we need to update both. In even more complex situations (like
Andra's MST) in some iterations we need to update the vertex values and in
some iterations we need to update the edge values.
Another problem is that we currently don't have a way to allow different
computational phases inside an iteration. This is something that Giraph
solves with master compute, a function that is executed once before each
superstep and sets the computation function.

All that said, I believe that we can solve most of these issues if we
nicely expose Flink's iteration operators in Gelly. I can see the following
cases:

1. Bulk & delta iterations where the solution set is the vertex dataset:
this will be similar to vertex-centric and GAS, but will allow more
flexible dataflows inside the iteration.
2. Bulk & delta iterations where the solution set is the edge dataset: for
the cases where we need to update edge values.
3. Bulk & delta iterations where the solution set is the Graph: this will
cover more complex cases, where the algorithm updates both vertices and
edges or even adds/removes vertices/edges, i.e. updates the whole Graph.

What do you think? I can see 1 & 2 being very easy to implement, but I
suspect 3 won't be that easy (but so awesome to have ^^).
Would it work the way a Graph is represented now, i.e. with 2 DataSets?

Any comment, idea, pointer would be much appreciated! Thank you ^^

Cheers,
-V.

[1]:
http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
[2]:
http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769

Re: [DISCUSS] Gelly iteration abstractions

Posted by Vasiliki Kalavri <va...@gmail.com>.

Hey,

Wow, Vasia, a GC limit error in the Optimizer (pre-flight), not the runtime.
>
> That is cannot have been easy to produce ;-) The program does not really
> look that tricky, I need to take some closer look at this...
>

It was pretty easy actually :P
The program is really simple: in each iteration, compute the vertex
degrees, filter out the ones that have degree less than K and emit the
resulting graph.


>
> BTW: Before this algorithm runs well, we really need pinning intermediate
> results in memory. Otherwise each step of the loop will execute its entire
> history every time again.


Can you elaborate a bit on this? Is there some ongoing work on pinning
intermediate results in memory?
And can I maybe help somehow? :-)


> A way to mitigate this for now is to manually
> break persist and re-read the intermediate graph.


I wouldn't want to mitigate this issue in a hacky way, but rather have a
nice and clean solution to allow writing such iterative programs, where a
graph is transformed inside an iteration using Gelly methods and then fed
as input to the next iteration.


> The ML library does that
> in some points to break programs into shorter, more feasible portions.
>

Could you point me to an example where this is happening?


>
> Also: Do you cache the vertex count in the graph? May be worth doing,
> otherwise all counts are computed twice (for current graph and for the
> previous graph)
>
>
That's a valid point, I can cache the previous count. But even without
computing the degrees and just running the loop for a fixed number of
iterations results in the same error :/



> Stephan
>
>
Thanks a lot!

-V.



>
> On Mon, Mar 30, 2015 at 10:33 PM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com> wrote:
>
> > Hi,
> >
> > I'm back looking into this and I tried out the for-loop approach that was
> > suggested above.
> >
> > I implemented a simple algorithm, k-core, which computes the k-core of a
> > graph by iteratively filtering out vertices with degree less than k.
> > You can find the code in [1].
> >
> > Unfortunately, this is giving me the following error, with an example
> graph
> > of 10 vertices:
> >
> > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
> > exceeded
> > at
> >
> >
> org.apache.flink.optimizer.plan.PlanNode.mergeBranchPlanMaps(PlanNode.java:106)
> > at
> >
> >
> org.apache.flink.optimizer.plan.PlanNode.mergeBranchPlanMaps(PlanNode.java:99)
> > at
> >
> >
> org.apache.flink.optimizer.plan.DualInputPlanNode.<init>(DualInputPlanNode.java:90)
> > at
> >
> >
> org.apache.flink.optimizer.plan.DualInputPlanNode.<init>(DualInputPlanNode.java:69)
> > at
> >
> >
> org.apache.flink.optimizer.operators.HashJoinBuildSecondProperties.instantiate(HashJoinBuildSecondProperties.java:81)
> > at
> >
> >
> org.apache.flink.optimizer.dag.TwoInputNode.instantiate(TwoInputNode.java:607)
> > at
> >
> >
> org.apache.flink.optimizer.dag.TwoInputNode.addLocalCandidates(TwoInputNode.java:557)
> > at
> >
> >
> org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:478)
> > at
> >
> >
> org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:309)
> > at
> >
> >
> org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:309)
> > at
> >
> >
> org.apache.flink.optimizer.dag.SingleInputNode.getAlternativePlans(SingleInputNode.java:249)
> > at
> >
> >
> org.apache.flink.optimizer.dag.SingleInputNode.getAlternativePlans(SingleInputNode.java:249)
> > at
> >
> >
> org.apache.flink.optimizer.dag.DataSinkNode.getAlternativePlans(DataSinkNode.java:204)
> > at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:501)
> > at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:403)
> > at
> > org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:162)
> > at
> >
> >
> org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:51)
> > at
> >
> >
> org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:797)
> > at org.apache.flink.api.java.DataSet.count(DataSet.java:397)
> > at org.apache.flink.graph.Graph.numberOfVertices(Graph.java:913)
> > at org.apache.flink.graph.example.KCoreExample.main(KCoreExample.java:91)
> >
> > Basically, the first 3 iterations are executed normally, then the 4th
> gets
> > stuck and eventually the program fails with the above error message.
> > Any help / suggestion would be greatly appreciated ^^
> >
> > Cheers,
> > -Vasia.
> >
> > [1]:
> >
> >
> https://github.com/vasia/flink/commit/62b065b69746e25b74bee84d210d5c5b4ee761bd
> >
> > On 23 February 2015 at 18:05, Vasiliki Kalavri <
> vasilikikalavri@gmail.com>
> > wrote:
> >
> > > I see, thanks a lot for the answers!
> > > To rephrase my original question, would it make sense to define a
> > > closed-loop iteration where the state is the whole graph?
> > >
> > > If you want to take a look at the current implementation of DMST using
> > > delta iteration, Andra has made a PR [1].
> > > On a high-level, this algorithm does (more or less) the following:
> > >
> > > while there are more than one vertices in the graph
> > >    for every vertex
> > >       select the min-weight edge
> > >       add all selected edges to the MST
> > >    collapse vertices with the same root into one vertex (iterative
> > > connected components-like step)
> > >
> > > What we are thinking is that we could express this with -maybe- a
> > > bulk-like iteration on the Graph, inside which we can use Gelly
> methods.
> > > Would it make sense or shall we go for a for-loop implementation
> instead?
> > >
> > > Thanks!
> > > -V.
> > >
> > > [1]: http://github.com/apache/flink/pull/434
> > >
> > > On 23 February 2015 at 16:18, Stephan Ewen <se...@apache.org> wrote:
> > >
> > >> Closed-loop iterations are much more efficient right now. Long for
> loops
> > >> suffer from memory fragmentation (an issue that is in the list to
> fix).
> > >>
> > >> Also, closed loops can be stateful (delta iterations) and do not
> require
> > >> task re-deployment.
> > >>
> > >> On Mon, Feb 23, 2015 at 4:15 PM, Vasiliki Kalavri <
> > >> vasilikikalavri@gmail.com
> > >> > wrote:
> > >>
> > >> > I see that's cool :-)
> > >> > So, what is the advantage of closed-loop versus for-loop iterations?
> > >> > Custom convergence criteria / aggregators and more efficient
> execution
> > >> > plans?
> > >> >
> > >> > On 23 February 2015 at 15:01, Stephan Ewen <se...@apache.org>
> wrote:
> > >> >
> > >> > > For loops are basically rolled out - they yield long execution
> > plans.
> > >> > >
> > >> > > On Mon, Feb 23, 2015 at 2:44 PM, Vasiliki Kalavri <
> > >> > > vasilikikalavri@gmail.com
> > >> > > > wrote:
> > >> > >
> > >> > > > for-loop iterations could cover some cases, I guess, when the
> > >> number of
> > >> > > > iterations is known beforehand.
> > >> > > > Are there currently any restrictions on what can be used inside
> a
> > >> > > for-loop?
> > >> > > > How are they translated into execution plans?
> > >> > > >
> > >> > > > On 23 February 2015 at 13:08, Stephan Ewen <se...@apache.org>
> > >> wrote:
> > >> > > >
> > >> > > > > Some things may not work well as "closed-loop" iterations.
> > >> > > > >
> > >> > > > > Is it possible to express those as for-loop iterations?
> > >> > > > >
> > >> > > > > On Mon, Feb 23, 2015 at 1:03 PM, Vasiliki Kalavri <
> > >> > > > > vasilikikalavri@gmail.com
> > >> > > > > > wrote:
> > >> > > > >
> > >> > > > > > Hi Stephan,
> > >> > > > > >
> > >> > > > > > yes, this would work for the cases where an algorithm only
> > >> updates
> > >> > > the
> > >> > > > > > vertex values or only updates the edge values.
> > >> > > > > >
> > >> > > > > > What we would like to also support is
> > >> > > > > > (a) algorithms where both vertices and edges are updated in
> > one
> > >> > > > iteration
> > >> > > > > > (b) algorithms where the graph structure changes from one
> > >> iteration
> > >> > > to
> > >> > > > > the
> > >> > > > > > next and
> > >> > > > > > (c) branching inside an iteration, i.e. executing a
> different
> > >> > > > "iteration
> > >> > > > > > body" based on some condition.
> > >> > > > > >
> > >> > > > > > We can still currently implement those with regular Flink
> > >> iteration
> > >> > > > > > operators, but the resulting code is not that nice or
> > efficient.
> > >> > > > > > For example, if we want to update both edges and vertex
> > values,
> > >> we
> > >> > > can
> > >> > > > > > still create a solution set where the vertex values are
> > >> attached to
> > >> > > > each
> > >> > > > > > edge.
> > >> > > > > > Regarding branching inside an iteration, we can use an
> > >> aggregator
> > >> > > that
> > >> > > > > > tracks the iteration phase, but then we need to somehow make
> > the
> > >> > > > > different
> > >> > > > > > phases to consist of the same operators and also check the
> > >> > branching
> > >> > > > > > condition inside each UDF.
> > >> > > > > >
> > >> > > > > > Cheers,
> > >> > > > > > V.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On 23 February 2015 at 11:05, Stephan Ewen <
> sewen@apache.org>
> > >> > wrote:
> > >> > > > > >
> > >> > > > > > > As a workaround, it should always work to get the Edge and
> > >> Vertex
> > >> > > > data
> > >> > > > > > set
> > >> > > > > > > from the graph and use the regular Fink iteration
> operators?
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <
> > >> > > > > > > vasilikikalavri@gmail.com
> > >> > > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi,
> > >> > > > > > > >
> > >> > > > > > > > yes, I was referring to the parallel Boruvka algorithm.
> > >> There
> > >> > are
> > >> > > > > > several
> > >> > > > > > > > ways to implement this one in Flink and I believe that
> the
> > >> one
> > >> > > > > > described
> > >> > > > > > > in
> > >> > > > > > > > the paper (vertex-centric) is not the most elegant one
> :)
> > >> > > > > > > >
> > >> > > > > > > > Andra is now working on an idea that uses the delta
> > >> iteration
> > >> > > > > > abstraction
> > >> > > > > > > > and we believe that it will be both more efficient and
> > >> easier
> > >> > to
> > >> > > > > > > > understand. It has the edges in the solution set and the
> > >> > vertices
> > >> > > > in
> > >> > > > > > the
> > >> > > > > > > > workset, so it follows the pattern I describe in (2) in
> my
> > >> > > previous
> > >> > > > > > > e-mail.
> > >> > > > > > > > As a next step, we would like to see how having an
> > iteration
> > >> > > > operator
> > >> > > > > > > that
> > >> > > > > > > > could update the whole graph -what I describe as (3)-
> > would
> > >> > make
> > >> > > > this
> > >> > > > > > > even
> > >> > > > > > > > nicer.
> > >> > > > > > > >
> > >> > > > > > > > Any ideas are highly welcome!
> > >> > > > > > > >
> > >> > > > > > > > Cheers,
> > >> > > > > > > > V.
> > >> > > > > > > >
> > >> > > > > > > > On 22 February 2015 at 16:32, Andra Lungu <
> > >> > lungu.andra@gmail.com
> > >> > > >
> > >> > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Hi Alex,
> > >> > > > > > > > >
> > >> > > > > > > > > Vasia is talking about the second version(presented
> > >> Friday)
> > >> > of
> > >> > > > > > Parallel
> > >> > > > > > > > > Boruvka, which can be found here:
> > >> > > > > > > > >
> https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
> > >> > > > > > > > >
> > >> > > > > > > > > I will propose the third, non-Pregel like approach
> > >> directly
> > >> > to
> > >> > > > > Gelly
> > >> > > > > > > > soon.
> > >> > > > > > > > >
> > >> > > > > > > > > If you have additional questions, I will be happy to
> > >> answer
> > >> > > them.
> > >> > > > > > > > >
> > >> > > > > > > > > Andra
> > >> > > > > > > > >
> > >> > > > > > > > > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov
> <
> > >> > > > > > > > > alexander.s.alexandrov@gmail.com> wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > Hi Vasia,
> > >> > > > > > > > > >
> > >> > > > > > > > > > I am trying to look at the problem in more detail.
> > Which
> > >> > > > version
> > >> > > > > of
> > >> > > > > > > the
> > >> > > > > > > > > MST
> > >> > > > > > > > > > are you talking about?
> > >> > > > > > > > > >
> > >> > > > > > > > > > Right now in the Gelly repository I can only find
> the
> > >> SSSP
> > >> > > > > example
> > >> > > > > > > > > > (parallel Bellman-Ford) from Section 4.2 in [1].
> > >> > > > > > > > > >
> > >> > > > > > > > > > However, it seems that the issues encountered by
> Andra
> > >> are
> > >> > > > > related
> > >> > > > > > to
> > >> > > > > > > > the
> > >> > > > > > > > > > implementation of Parallel Boruvka (Section 3.2 in
> > >> [2]). Is
> > >> > > > that
> > >> > > > > > > > correct?
> > >> > > > > > > > > >
> > >> > > > > > > > > > Regards,
> > >> > > > > > > > > > A.
> > >> > > > > > > > > >
> > >> > > > > > > > > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> > >> > > > > > > > > > [2]
> http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
> > >> > > > > > > > > >
> > >> > > > > > > > > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <
> > >> > > > > > > vasilikikalavri@gmail.com
> > >> > > > > > > > >:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > Hello beautiful Flink people,
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > during the past few days, Andra and I have been
> > >> > discussing
> > >> > > > > about
> > >> > > > > > > how
> > >> > > > > > > > to
> > >> > > > > > > > > > > extend Gelly's iteration methods.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Alexander's course (and his awesome students) has
> > >> made it
> > >> > > > > obvious
> > >> > > > > > > > that
> > >> > > > > > > > > > > vertex-centric iterations are not the best fit for
> > >> > > algorithms
> > >> > > > > > which
> > >> > > > > > > > > don't
> > >> > > > > > > > > > > follow the common "propagate-update" pattern. For
> > >> > example,
> > >> > > > > Andra
> > >> > > > > > is
> > >> > > > > > > > > > working
> > >> > > > > > > > > > > on an implementation of Minimum Spanning Tree,
> which
> > >> > > requires
> > >> > > > > > > > branching
> > >> > > > > > > > > > > inside an iteration and also requires a
> convergence
> > >> check
> > >> > > of
> > >> > > > an
> > >> > > > > > > > > internal
> > >> > > > > > > > > > > iteration. Others also reported similar issues [1,
> > 2].
> > >> > > Trying
> > >> > > > > to
> > >> > > > > > > fit
> > >> > > > > > > > > such
> > >> > > > > > > > > > > algorithms to the vertex-centric model leads to
> long
> > >> and
> > >> > > ugly
> > >> > > > > > code,
> > >> > > > > > > > > e.g.
> > >> > > > > > > > > > > aggregators to keep track of algorithm phases,
> > >> > duplicating
> > >> > > > > data,
> > >> > > > > > > etc.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > One limitation of the vertex-centric and the
> > upcoming
> > >> GAS
> > >> > > > model
> > >> > > > > > is
> > >> > > > > > > > that
> > >> > > > > > > > > > > they both only allow the vertex values to be
> updated
> > >> in
> > >> > > each
> > >> > > > > > > > iteration.
> > >> > > > > > > > > > > However, for some algorithms we need to update the
> > >> edge
> > >> > > > values
> > >> > > > > > and
> > >> > > > > > > in
> > >> > > > > > > > > > > others we need to update both. In even more
> complex
> > >> > > > situations
> > >> > > > > > > (like
> > >> > > > > > > > > > > Andra's MST) in some iterations we need to update
> > the
> > >> > > vertex
> > >> > > > > > values
> > >> > > > > > > > and
> > >> > > > > > > > > > in
> > >> > > > > > > > > > > some iterations we need to update the edge values.
> > >> > > > > > > > > > > Another problem is that we currently don't have a
> > way
> > >> to
> > >> > > > allow
> > >> > > > > > > > > different
> > >> > > > > > > > > > > computational phases inside an iteration. This is
> > >> > something
> > >> > > > > that
> > >> > > > > > > > Giraph
> > >> > > > > > > > > > > solves with master compute, a function that is
> > >> executed
> > >> > > once
> > >> > > > > > before
> > >> > > > > > > > > each
> > >> > > > > > > > > > > superstep and sets the computation function.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > All that said, I believe that we can solve most of
> > >> these
> > >> > > > issues
> > >> > > > > > if
> > >> > > > > > > we
> > >> > > > > > > > > > > nicely expose Flink's iteration operators in
> Gelly.
> > I
> > >> can
> > >> > > see
> > >> > > > > the
> > >> > > > > > > > > > following
> > >> > > > > > > > > > > cases:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > 1. Bulk & delta iterations where the solution set
> is
> > >> the
> > >> > > > vertex
> > >> > > > > > > > > dataset:
> > >> > > > > > > > > > > this will be similar to vertex-centric and GAS,
> but
> > >> will
> > >> > > > allow
> > >> > > > > > more
> > >> > > > > > > > > > > flexible dataflows inside the iteration.
> > >> > > > > > > > > > > 2. Bulk & delta iterations where the solution set
> is
> > >> the
> > >> > > edge
> > >> > > > > > > > dataset:
> > >> > > > > > > > > > for
> > >> > > > > > > > > > > the cases where we need to update edge values.
> > >> > > > > > > > > > > 3. Bulk & delta iterations where the solution set
> is
> > >> the
> > >> > > > Graph:
> > >> > > > > > > this
> > >> > > > > > > > > will
> > >> > > > > > > > > > > cover more complex cases, where the algorithm
> > updates
> > >> > both
> > >> > > > > > vertices
> > >> > > > > > > > and
> > >> > > > > > > > > > > edges or even adds/removes vertices/edges, i.e.
> > >> updates
> > >> > the
> > >> > > > > whole
> > >> > > > > > > > > Graph.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > What do you think? I can see 1 & 2 being very easy
> > to
> > >> > > > > implement,
> > >> > > > > > > but
> > >> > > > > > > > I
> > >> > > > > > > > > > > suspect 3 won't be that easy (but so awesome to
> have
> > >> ^^).
> > >> > > > > > > > > > > Would it work the way a Graph is represented now,
> > i.e.
> > >> > > with 2
> > >> > > > > > > > DataSets?
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Any comment, idea, pointer would be much
> > appreciated!
> > >> > Thank
> > >> > > > you
> > >> > > > > > ^^
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Cheers,
> > >> > > > > > > > > > > -V.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > [1]:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> > >> > > > > > > > > > > [2]:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Stephan Ewen <se...@apache.org>.

Wow, Vasia, a GC limit error in the Optimizer (pre-flight), not the runtime.

That is cannot have been easy to produce ;-) The program does not really
look that tricky, I need to take some closer look at this...

BTW: Before this algorithm runs well, we really need pinning intermediate
results in memory. Otherwise each step of the loop will execute its entire
history every time again. A way to mitigate this for now is to manually
break persist and re-read the intermediate graph. The ML library does that
in some points to break programs into shorter, more feasible portions.

Also: Do you cache the vertex count in the graph? May be worth doing,
otherwise all counts are computed twice (for current graph and for the
previous graph)

Stephan


On Mon, Mar 30, 2015 at 10:33 PM, Vasiliki Kalavri <
vasilikikalavri@gmail.com> wrote:

> Hi,
>
> I'm back looking into this and I tried out the for-loop approach that was
> suggested above.
>
> I implemented a simple algorithm, k-core, which computes the k-core of a
> graph by iteratively filtering out vertices with degree less than k.
> You can find the code in [1].
>
> Unfortunately, this is giving me the following error, with an example graph
> of 10 vertices:
>
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
> exceeded
> at
>
> org.apache.flink.optimizer.plan.PlanNode.mergeBranchPlanMaps(PlanNode.java:106)
> at
>
> org.apache.flink.optimizer.plan.PlanNode.mergeBranchPlanMaps(PlanNode.java:99)
> at
>
> org.apache.flink.optimizer.plan.DualInputPlanNode.<init>(DualInputPlanNode.java:90)
> at
>
> org.apache.flink.optimizer.plan.DualInputPlanNode.<init>(DualInputPlanNode.java:69)
> at
>
> org.apache.flink.optimizer.operators.HashJoinBuildSecondProperties.instantiate(HashJoinBuildSecondProperties.java:81)
> at
>
> org.apache.flink.optimizer.dag.TwoInputNode.instantiate(TwoInputNode.java:607)
> at
>
> org.apache.flink.optimizer.dag.TwoInputNode.addLocalCandidates(TwoInputNode.java:557)
> at
>
> org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:478)
> at
>
> org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:309)
> at
>
> org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:309)
> at
>
> org.apache.flink.optimizer.dag.SingleInputNode.getAlternativePlans(SingleInputNode.java:249)
> at
>
> org.apache.flink.optimizer.dag.SingleInputNode.getAlternativePlans(SingleInputNode.java:249)
> at
>
> org.apache.flink.optimizer.dag.DataSinkNode.getAlternativePlans(DataSinkNode.java:204)
> at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:501)
> at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:403)
> at
> org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:162)
> at
>
> org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:51)
> at
>
> org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:797)
> at org.apache.flink.api.java.DataSet.count(DataSet.java:397)
> at org.apache.flink.graph.Graph.numberOfVertices(Graph.java:913)
> at org.apache.flink.graph.example.KCoreExample.main(KCoreExample.java:91)
>
> Basically, the first 3 iterations are executed normally, then the 4th gets
> stuck and eventually the program fails with the above error message.
> Any help / suggestion would be greatly appreciated ^^
>
> Cheers,
> -Vasia.
>
> [1]:
>
> https://github.com/vasia/flink/commit/62b065b69746e25b74bee84d210d5c5b4ee761bd
>
> On 23 February 2015 at 18:05, Vasiliki Kalavri <va...@gmail.com>
> wrote:
>
> > I see, thanks a lot for the answers!
> > To rephrase my original question, would it make sense to define a
> > closed-loop iteration where the state is the whole graph?
> >
> > If you want to take a look at the current implementation of DMST using
> > delta iteration, Andra has made a PR [1].
> > On a high-level, this algorithm does (more or less) the following:
> >
> > while there are more than one vertices in the graph
> >    for every vertex
> >       select the min-weight edge
> >       add all selected edges to the MST
> >    collapse vertices with the same root into one vertex (iterative
> > connected components-like step)
> >
> > What we are thinking is that we could express this with -maybe- a
> > bulk-like iteration on the Graph, inside which we can use Gelly methods.
> > Would it make sense or shall we go for a for-loop implementation instead?
> >
> > Thanks!
> > -V.
> >
> > [1]: http://github.com/apache/flink/pull/434
> >
> > On 23 February 2015 at 16:18, Stephan Ewen <se...@apache.org> wrote:
> >
> >> Closed-loop iterations are much more efficient right now. Long for loops
> >> suffer from memory fragmentation (an issue that is in the list to fix).
> >>
> >> Also, closed loops can be stateful (delta iterations) and do not require
> >> task re-deployment.
> >>
> >> On Mon, Feb 23, 2015 at 4:15 PM, Vasiliki Kalavri <
> >> vasilikikalavri@gmail.com
> >> > wrote:
> >>
> >> > I see that's cool :-)
> >> > So, what is the advantage of closed-loop versus for-loop iterations?
> >> > Custom convergence criteria / aggregators and more efficient execution
> >> > plans?
> >> >
> >> > On 23 February 2015 at 15:01, Stephan Ewen <se...@apache.org> wrote:
> >> >
> >> > > For loops are basically rolled out - they yield long execution
> plans.
> >> > >
> >> > > On Mon, Feb 23, 2015 at 2:44 PM, Vasiliki Kalavri <
> >> > > vasilikikalavri@gmail.com
> >> > > > wrote:
> >> > >
> >> > > > for-loop iterations could cover some cases, I guess, when the
> >> number of
> >> > > > iterations is known beforehand.
> >> > > > Are there currently any restrictions on what can be used inside a
> >> > > for-loop?
> >> > > > How are they translated into execution plans?
> >> > > >
> >> > > > On 23 February 2015 at 13:08, Stephan Ewen <se...@apache.org>
> >> wrote:
> >> > > >
> >> > > > > Some things may not work well as "closed-loop" iterations.
> >> > > > >
> >> > > > > Is it possible to express those as for-loop iterations?
> >> > > > >
> >> > > > > On Mon, Feb 23, 2015 at 1:03 PM, Vasiliki Kalavri <
> >> > > > > vasilikikalavri@gmail.com
> >> > > > > > wrote:
> >> > > > >
> >> > > > > > Hi Stephan,
> >> > > > > >
> >> > > > > > yes, this would work for the cases where an algorithm only
> >> updates
> >> > > the
> >> > > > > > vertex values or only updates the edge values.
> >> > > > > >
> >> > > > > > What we would like to also support is
> >> > > > > > (a) algorithms where both vertices and edges are updated in
> one
> >> > > > iteration
> >> > > > > > (b) algorithms where the graph structure changes from one
> >> iteration
> >> > > to
> >> > > > > the
> >> > > > > > next and
> >> > > > > > (c) branching inside an iteration, i.e. executing a different
> >> > > > "iteration
> >> > > > > > body" based on some condition.
> >> > > > > >
> >> > > > > > We can still currently implement those with regular Flink
> >> iteration
> >> > > > > > operators, but the resulting code is not that nice or
> efficient.
> >> > > > > > For example, if we want to update both edges and vertex
> values,
> >> we
> >> > > can
> >> > > > > > still create a solution set where the vertex values are
> >> attached to
> >> > > > each
> >> > > > > > edge.
> >> > > > > > Regarding branching inside an iteration, we can use an
> >> aggregator
> >> > > that
> >> > > > > > tracks the iteration phase, but then we need to somehow make
> the
> >> > > > > different
> >> > > > > > phases to consist of the same operators and also check the
> >> > branching
> >> > > > > > condition inside each UDF.
> >> > > > > >
> >> > > > > > Cheers,
> >> > > > > > V.
> >> > > > > >
> >> > > > > >
> >> > > > > > On 23 February 2015 at 11:05, Stephan Ewen <se...@apache.org>
> >> > wrote:
> >> > > > > >
> >> > > > > > > As a workaround, it should always work to get the Edge and
> >> Vertex
> >> > > > data
> >> > > > > > set
> >> > > > > > > from the graph and use the regular Fink iteration operators?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <
> >> > > > > > > vasilikikalavri@gmail.com
> >> > > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi,
> >> > > > > > > >
> >> > > > > > > > yes, I was referring to the parallel Boruvka algorithm.
> >> There
> >> > are
> >> > > > > > several
> >> > > > > > > > ways to implement this one in Flink and I believe that the
> >> one
> >> > > > > > described
> >> > > > > > > in
> >> > > > > > > > the paper (vertex-centric) is not the most elegant one :)
> >> > > > > > > >
> >> > > > > > > > Andra is now working on an idea that uses the delta
> >> iteration
> >> > > > > > abstraction
> >> > > > > > > > and we believe that it will be both more efficient and
> >> easier
> >> > to
> >> > > > > > > > understand. It has the edges in the solution set and the
> >> > vertices
> >> > > > in
> >> > > > > > the
> >> > > > > > > > workset, so it follows the pattern I describe in (2) in my
> >> > > previous
> >> > > > > > > e-mail.
> >> > > > > > > > As a next step, we would like to see how having an
> iteration
> >> > > > operator
> >> > > > > > > that
> >> > > > > > > > could update the whole graph -what I describe as (3)-
> would
> >> > make
> >> > > > this
> >> > > > > > > even
> >> > > > > > > > nicer.
> >> > > > > > > >
> >> > > > > > > > Any ideas are highly welcome!
> >> > > > > > > >
> >> > > > > > > > Cheers,
> >> > > > > > > > V.
> >> > > > > > > >
> >> > > > > > > > On 22 February 2015 at 16:32, Andra Lungu <
> >> > lungu.andra@gmail.com
> >> > > >
> >> > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi Alex,
> >> > > > > > > > >
> >> > > > > > > > > Vasia is talking about the second version(presented
> >> Friday)
> >> > of
> >> > > > > > Parallel
> >> > > > > > > > > Boruvka, which can be found here:
> >> > > > > > > > > https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
> >> > > > > > > > >
> >> > > > > > > > > I will propose the third, non-Pregel like approach
> >> directly
> >> > to
> >> > > > > Gelly
> >> > > > > > > > soon.
> >> > > > > > > > >
> >> > > > > > > > > If you have additional questions, I will be happy to
> >> answer
> >> > > them.
> >> > > > > > > > >
> >> > > > > > > > > Andra
> >> > > > > > > > >
> >> > > > > > > > > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov <
> >> > > > > > > > > alexander.s.alexandrov@gmail.com> wrote:
> >> > > > > > > > >
> >> > > > > > > > > > Hi Vasia,
> >> > > > > > > > > >
> >> > > > > > > > > > I am trying to look at the problem in more detail.
> Which
> >> > > > version
> >> > > > > of
> >> > > > > > > the
> >> > > > > > > > > MST
> >> > > > > > > > > > are you talking about?
> >> > > > > > > > > >
> >> > > > > > > > > > Right now in the Gelly repository I can only find the
> >> SSSP
> >> > > > > example
> >> > > > > > > > > > (parallel Bellman-Ford) from Section 4.2 in [1].
> >> > > > > > > > > >
> >> > > > > > > > > > However, it seems that the issues encountered by Andra
> >> are
> >> > > > > related
> >> > > > > > to
> >> > > > > > > > the
> >> > > > > > > > > > implementation of Parallel Boruvka (Section 3.2 in
> >> [2]). Is
> >> > > > that
> >> > > > > > > > correct?
> >> > > > > > > > > >
> >> > > > > > > > > > Regards,
> >> > > > > > > > > > A.
> >> > > > > > > > > >
> >> > > > > > > > > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> >> > > > > > > > > > [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
> >> > > > > > > > > >
> >> > > > > > > > > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <
> >> > > > > > > vasilikikalavri@gmail.com
> >> > > > > > > > >:
> >> > > > > > > > > >
> >> > > > > > > > > > > Hello beautiful Flink people,
> >> > > > > > > > > > >
> >> > > > > > > > > > > during the past few days, Andra and I have been
> >> > discussing
> >> > > > > about
> >> > > > > > > how
> >> > > > > > > > to
> >> > > > > > > > > > > extend Gelly's iteration methods.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Alexander's course (and his awesome students) has
> >> made it
> >> > > > > obvious
> >> > > > > > > > that
> >> > > > > > > > > > > vertex-centric iterations are not the best fit for
> >> > > algorithms
> >> > > > > > which
> >> > > > > > > > > don't
> >> > > > > > > > > > > follow the common "propagate-update" pattern. For
> >> > example,
> >> > > > > Andra
> >> > > > > > is
> >> > > > > > > > > > working
> >> > > > > > > > > > > on an implementation of Minimum Spanning Tree, which
> >> > > requires
> >> > > > > > > > branching
> >> > > > > > > > > > > inside an iteration and also requires a convergence
> >> check
> >> > > of
> >> > > > an
> >> > > > > > > > > internal
> >> > > > > > > > > > > iteration. Others also reported similar issues [1,
> 2].
> >> > > Trying
> >> > > > > to
> >> > > > > > > fit
> >> > > > > > > > > such
> >> > > > > > > > > > > algorithms to the vertex-centric model leads to long
> >> and
> >> > > ugly
> >> > > > > > code,
> >> > > > > > > > > e.g.
> >> > > > > > > > > > > aggregators to keep track of algorithm phases,
> >> > duplicating
> >> > > > > data,
> >> > > > > > > etc.
> >> > > > > > > > > > >
> >> > > > > > > > > > > One limitation of the vertex-centric and the
> upcoming
> >> GAS
> >> > > > model
> >> > > > > > is
> >> > > > > > > > that
> >> > > > > > > > > > > they both only allow the vertex values to be updated
> >> in
> >> > > each
> >> > > > > > > > iteration.
> >> > > > > > > > > > > However, for some algorithms we need to update the
> >> edge
> >> > > > values
> >> > > > > > and
> >> > > > > > > in
> >> > > > > > > > > > > others we need to update both. In even more complex
> >> > > > situations
> >> > > > > > > (like
> >> > > > > > > > > > > Andra's MST) in some iterations we need to update
> the
> >> > > vertex
> >> > > > > > values
> >> > > > > > > > and
> >> > > > > > > > > > in
> >> > > > > > > > > > > some iterations we need to update the edge values.
> >> > > > > > > > > > > Another problem is that we currently don't have a
> way
> >> to
> >> > > > allow
> >> > > > > > > > > different
> >> > > > > > > > > > > computational phases inside an iteration. This is
> >> > something
> >> > > > > that
> >> > > > > > > > Giraph
> >> > > > > > > > > > > solves with master compute, a function that is
> >> executed
> >> > > once
> >> > > > > > before
> >> > > > > > > > > each
> >> > > > > > > > > > > superstep and sets the computation function.
> >> > > > > > > > > > >
> >> > > > > > > > > > > All that said, I believe that we can solve most of
> >> these
> >> > > > issues
> >> > > > > > if
> >> > > > > > > we
> >> > > > > > > > > > > nicely expose Flink's iteration operators in Gelly.
> I
> >> can
> >> > > see
> >> > > > > the
> >> > > > > > > > > > following
> >> > > > > > > > > > > cases:
> >> > > > > > > > > > >
> >> > > > > > > > > > > 1. Bulk & delta iterations where the solution set is
> >> the
> >> > > > vertex
> >> > > > > > > > > dataset:
> >> > > > > > > > > > > this will be similar to vertex-centric and GAS, but
> >> will
> >> > > > allow
> >> > > > > > more
> >> > > > > > > > > > > flexible dataflows inside the iteration.
> >> > > > > > > > > > > 2. Bulk & delta iterations where the solution set is
> >> the
> >> > > edge
> >> > > > > > > > dataset:
> >> > > > > > > > > > for
> >> > > > > > > > > > > the cases where we need to update edge values.
> >> > > > > > > > > > > 3. Bulk & delta iterations where the solution set is
> >> the
> >> > > > Graph:
> >> > > > > > > this
> >> > > > > > > > > will
> >> > > > > > > > > > > cover more complex cases, where the algorithm
> updates
> >> > both
> >> > > > > > vertices
> >> > > > > > > > and
> >> > > > > > > > > > > edges or even adds/removes vertices/edges, i.e.
> >> updates
> >> > the
> >> > > > > whole
> >> > > > > > > > > Graph.
> >> > > > > > > > > > >
> >> > > > > > > > > > > What do you think? I can see 1 & 2 being very easy
> to
> >> > > > > implement,
> >> > > > > > > but
> >> > > > > > > > I
> >> > > > > > > > > > > suspect 3 won't be that easy (but so awesome to have
> >> ^^).
> >> > > > > > > > > > > Would it work the way a Graph is represented now,
> i.e.
> >> > > with 2
> >> > > > > > > > DataSets?
> >> > > > > > > > > > >
> >> > > > > > > > > > > Any comment, idea, pointer would be much
> appreciated!
> >> > Thank
> >> > > > you
> >> > > > > > ^^
> >> > > > > > > > > > >
> >> > > > > > > > > > > Cheers,
> >> > > > > > > > > > > -V.
> >> > > > > > > > > > >
> >> > > > > > > > > > > [1]:
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> >> > > > > > > > > > > [2]:
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Vasiliki Kalavri <va...@gmail.com>.

Hi,

I'm back looking into this and I tried out the for-loop approach that was
suggested above.

I implemented a simple algorithm, k-core, which computes the k-core of a
graph by iteratively filtering out vertices with degree less than k.
You can find the code in [1].

Unfortunately, this is giving me the following error, with an example graph
of 10 vertices:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
exceeded
at
org.apache.flink.optimizer.plan.PlanNode.mergeBranchPlanMaps(PlanNode.java:106)
at
org.apache.flink.optimizer.plan.PlanNode.mergeBranchPlanMaps(PlanNode.java:99)
at
org.apache.flink.optimizer.plan.DualInputPlanNode.<init>(DualInputPlanNode.java:90)
at
org.apache.flink.optimizer.plan.DualInputPlanNode.<init>(DualInputPlanNode.java:69)
at
org.apache.flink.optimizer.operators.HashJoinBuildSecondProperties.instantiate(HashJoinBuildSecondProperties.java:81)
at
org.apache.flink.optimizer.dag.TwoInputNode.instantiate(TwoInputNode.java:607)
at
org.apache.flink.optimizer.dag.TwoInputNode.addLocalCandidates(TwoInputNode.java:557)
at
org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:478)
at
org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:309)
at
org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:309)
at
org.apache.flink.optimizer.dag.SingleInputNode.getAlternativePlans(SingleInputNode.java:249)
at
org.apache.flink.optimizer.dag.SingleInputNode.getAlternativePlans(SingleInputNode.java:249)
at
org.apache.flink.optimizer.dag.DataSinkNode.getAlternativePlans(DataSinkNode.java:204)
at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:501)
at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:403)
at org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:162)
at
org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:51)
at
org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:797)
at org.apache.flink.api.java.DataSet.count(DataSet.java:397)
at org.apache.flink.graph.Graph.numberOfVertices(Graph.java:913)
at org.apache.flink.graph.example.KCoreExample.main(KCoreExample.java:91)

Basically, the first 3 iterations are executed normally, then the 4th gets
stuck and eventually the program fails with the above error message.
Any help / suggestion would be greatly appreciated ^^

Cheers,
-Vasia.

[1]:
https://github.com/vasia/flink/commit/62b065b69746e25b74bee84d210d5c5b4ee761bd

On 23 February 2015 at 18:05, Vasiliki Kalavri <va...@gmail.com>
wrote:

> I see, thanks a lot for the answers!
> To rephrase my original question, would it make sense to define a
> closed-loop iteration where the state is the whole graph?
>
> If you want to take a look at the current implementation of DMST using
> delta iteration, Andra has made a PR [1].
> On a high-level, this algorithm does (more or less) the following:
>
> while there are more than one vertices in the graph
>    for every vertex
>       select the min-weight edge
>       add all selected edges to the MST
>    collapse vertices with the same root into one vertex (iterative
> connected components-like step)
>
> What we are thinking is that we could express this with -maybe- a
> bulk-like iteration on the Graph, inside which we can use Gelly methods.
> Would it make sense or shall we go for a for-loop implementation instead?
>
> Thanks!
> -V.
>
> [1]: http://github.com/apache/flink/pull/434
>
> On 23 February 2015 at 16:18, Stephan Ewen <se...@apache.org> wrote:
>
>> Closed-loop iterations are much more efficient right now. Long for loops
>> suffer from memory fragmentation (an issue that is in the list to fix).
>>
>> Also, closed loops can be stateful (delta iterations) and do not require
>> task re-deployment.
>>
>> On Mon, Feb 23, 2015 at 4:15 PM, Vasiliki Kalavri <
>> vasilikikalavri@gmail.com
>> > wrote:
>>
>> > I see that's cool :-)
>> > So, what is the advantage of closed-loop versus for-loop iterations?
>> > Custom convergence criteria / aggregators and more efficient execution
>> > plans?
>> >
>> > On 23 February 2015 at 15:01, Stephan Ewen <se...@apache.org> wrote:
>> >
>> > > For loops are basically rolled out - they yield long execution plans.
>> > >
>> > > On Mon, Feb 23, 2015 at 2:44 PM, Vasiliki Kalavri <
>> > > vasilikikalavri@gmail.com
>> > > > wrote:
>> > >
>> > > > for-loop iterations could cover some cases, I guess, when the
>> number of
>> > > > iterations is known beforehand.
>> > > > Are there currently any restrictions on what can be used inside a
>> > > for-loop?
>> > > > How are they translated into execution plans?
>> > > >
>> > > > On 23 February 2015 at 13:08, Stephan Ewen <se...@apache.org>
>> wrote:
>> > > >
>> > > > > Some things may not work well as "closed-loop" iterations.
>> > > > >
>> > > > > Is it possible to express those as for-loop iterations?
>> > > > >
>> > > > > On Mon, Feb 23, 2015 at 1:03 PM, Vasiliki Kalavri <
>> > > > > vasilikikalavri@gmail.com
>> > > > > > wrote:
>> > > > >
>> > > > > > Hi Stephan,
>> > > > > >
>> > > > > > yes, this would work for the cases where an algorithm only
>> updates
>> > > the
>> > > > > > vertex values or only updates the edge values.
>> > > > > >
>> > > > > > What we would like to also support is
>> > > > > > (a) algorithms where both vertices and edges are updated in one
>> > > > iteration
>> > > > > > (b) algorithms where the graph structure changes from one
>> iteration
>> > > to
>> > > > > the
>> > > > > > next and
>> > > > > > (c) branching inside an iteration, i.e. executing a different
>> > > > "iteration
>> > > > > > body" based on some condition.
>> > > > > >
>> > > > > > We can still currently implement those with regular Flink
>> iteration
>> > > > > > operators, but the resulting code is not that nice or efficient.
>> > > > > > For example, if we want to update both edges and vertex values,
>> we
>> > > can
>> > > > > > still create a solution set where the vertex values are
>> attached to
>> > > > each
>> > > > > > edge.
>> > > > > > Regarding branching inside an iteration, we can use an
>> aggregator
>> > > that
>> > > > > > tracks the iteration phase, but then we need to somehow make the
>> > > > > different
>> > > > > > phases to consist of the same operators and also check the
>> > branching
>> > > > > > condition inside each UDF.
>> > > > > >
>> > > > > > Cheers,
>> > > > > > V.
>> > > > > >
>> > > > > >
>> > > > > > On 23 February 2015 at 11:05, Stephan Ewen <se...@apache.org>
>> > wrote:
>> > > > > >
>> > > > > > > As a workaround, it should always work to get the Edge and
>> Vertex
>> > > > data
>> > > > > > set
>> > > > > > > from the graph and use the regular Fink iteration operators?
>> > > > > > >
>> > > > > > >
>> > > > > > > On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <
>> > > > > > > vasilikikalavri@gmail.com
>> > > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi,
>> > > > > > > >
>> > > > > > > > yes, I was referring to the parallel Boruvka algorithm.
>> There
>> > are
>> > > > > > several
>> > > > > > > > ways to implement this one in Flink and I believe that the
>> one
>> > > > > > described
>> > > > > > > in
>> > > > > > > > the paper (vertex-centric) is not the most elegant one :)
>> > > > > > > >
>> > > > > > > > Andra is now working on an idea that uses the delta
>> iteration
>> > > > > > abstraction
>> > > > > > > > and we believe that it will be both more efficient and
>> easier
>> > to
>> > > > > > > > understand. It has the edges in the solution set and the
>> > vertices
>> > > > in
>> > > > > > the
>> > > > > > > > workset, so it follows the pattern I describe in (2) in my
>> > > previous
>> > > > > > > e-mail.
>> > > > > > > > As a next step, we would like to see how having an iteration
>> > > > operator
>> > > > > > > that
>> > > > > > > > could update the whole graph -what I describe as (3)- would
>> > make
>> > > > this
>> > > > > > > even
>> > > > > > > > nicer.
>> > > > > > > >
>> > > > > > > > Any ideas are highly welcome!
>> > > > > > > >
>> > > > > > > > Cheers,
>> > > > > > > > V.
>> > > > > > > >
>> > > > > > > > On 22 February 2015 at 16:32, Andra Lungu <
>> > lungu.andra@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi Alex,
>> > > > > > > > >
>> > > > > > > > > Vasia is talking about the second version(presented
>> Friday)
>> > of
>> > > > > > Parallel
>> > > > > > > > > Boruvka, which can be found here:
>> > > > > > > > > https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
>> > > > > > > > >
>> > > > > > > > > I will propose the third, non-Pregel like approach
>> directly
>> > to
>> > > > > Gelly
>> > > > > > > > soon.
>> > > > > > > > >
>> > > > > > > > > If you have additional questions, I will be happy to
>> answer
>> > > them.
>> > > > > > > > >
>> > > > > > > > > Andra
>> > > > > > > > >
>> > > > > > > > > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov <
>> > > > > > > > > alexander.s.alexandrov@gmail.com> wrote:
>> > > > > > > > >
>> > > > > > > > > > Hi Vasia,
>> > > > > > > > > >
>> > > > > > > > > > I am trying to look at the problem in more detail. Which
>> > > > version
>> > > > > of
>> > > > > > > the
>> > > > > > > > > MST
>> > > > > > > > > > are you talking about?
>> > > > > > > > > >
>> > > > > > > > > > Right now in the Gelly repository I can only find the
>> SSSP
>> > > > > example
>> > > > > > > > > > (parallel Bellman-Ford) from Section 4.2 in [1].
>> > > > > > > > > >
>> > > > > > > > > > However, it seems that the issues encountered by Andra
>> are
>> > > > > related
>> > > > > > to
>> > > > > > > > the
>> > > > > > > > > > implementation of Parallel Boruvka (Section 3.2 in
>> [2]). Is
>> > > > that
>> > > > > > > > correct?
>> > > > > > > > > >
>> > > > > > > > > > Regards,
>> > > > > > > > > > A.
>> > > > > > > > > >
>> > > > > > > > > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
>> > > > > > > > > > [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
>> > > > > > > > > >
>> > > > > > > > > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <
>> > > > > > > vasilikikalavri@gmail.com
>> > > > > > > > >:
>> > > > > > > > > >
>> > > > > > > > > > > Hello beautiful Flink people,
>> > > > > > > > > > >
>> > > > > > > > > > > during the past few days, Andra and I have been
>> > discussing
>> > > > > about
>> > > > > > > how
>> > > > > > > > to
>> > > > > > > > > > > extend Gelly's iteration methods.
>> > > > > > > > > > >
>> > > > > > > > > > > Alexander's course (and his awesome students) has
>> made it
>> > > > > obvious
>> > > > > > > > that
>> > > > > > > > > > > vertex-centric iterations are not the best fit for
>> > > algorithms
>> > > > > > which
>> > > > > > > > > don't
>> > > > > > > > > > > follow the common "propagate-update" pattern. For
>> > example,
>> > > > > Andra
>> > > > > > is
>> > > > > > > > > > working
>> > > > > > > > > > > on an implementation of Minimum Spanning Tree, which
>> > > requires
>> > > > > > > > branching
>> > > > > > > > > > > inside an iteration and also requires a convergence
>> check
>> > > of
>> > > > an
>> > > > > > > > > internal
>> > > > > > > > > > > iteration. Others also reported similar issues [1, 2].
>> > > Trying
>> > > > > to
>> > > > > > > fit
>> > > > > > > > > such
>> > > > > > > > > > > algorithms to the vertex-centric model leads to long
>> and
>> > > ugly
>> > > > > > code,
>> > > > > > > > > e.g.
>> > > > > > > > > > > aggregators to keep track of algorithm phases,
>> > duplicating
>> > > > > data,
>> > > > > > > etc.
>> > > > > > > > > > >
>> > > > > > > > > > > One limitation of the vertex-centric and the upcoming
>> GAS
>> > > > model
>> > > > > > is
>> > > > > > > > that
>> > > > > > > > > > > they both only allow the vertex values to be updated
>> in
>> > > each
>> > > > > > > > iteration.
>> > > > > > > > > > > However, for some algorithms we need to update the
>> edge
>> > > > values
>> > > > > > and
>> > > > > > > in
>> > > > > > > > > > > others we need to update both. In even more complex
>> > > > situations
>> > > > > > > (like
>> > > > > > > > > > > Andra's MST) in some iterations we need to update the
>> > > vertex
>> > > > > > values
>> > > > > > > > and
>> > > > > > > > > > in
>> > > > > > > > > > > some iterations we need to update the edge values.
>> > > > > > > > > > > Another problem is that we currently don't have a way
>> to
>> > > > allow
>> > > > > > > > > different
>> > > > > > > > > > > computational phases inside an iteration. This is
>> > something
>> > > > > that
>> > > > > > > > Giraph
>> > > > > > > > > > > solves with master compute, a function that is
>> executed
>> > > once
>> > > > > > before
>> > > > > > > > > each
>> > > > > > > > > > > superstep and sets the computation function.
>> > > > > > > > > > >
>> > > > > > > > > > > All that said, I believe that we can solve most of
>> these
>> > > > issues
>> > > > > > if
>> > > > > > > we
>> > > > > > > > > > > nicely expose Flink's iteration operators in Gelly. I
>> can
>> > > see
>> > > > > the
>> > > > > > > > > > following
>> > > > > > > > > > > cases:
>> > > > > > > > > > >
>> > > > > > > > > > > 1. Bulk & delta iterations where the solution set is
>> the
>> > > > vertex
>> > > > > > > > > dataset:
>> > > > > > > > > > > this will be similar to vertex-centric and GAS, but
>> will
>> > > > allow
>> > > > > > more
>> > > > > > > > > > > flexible dataflows inside the iteration.
>> > > > > > > > > > > 2. Bulk & delta iterations where the solution set is
>> the
>> > > edge
>> > > > > > > > dataset:
>> > > > > > > > > > for
>> > > > > > > > > > > the cases where we need to update edge values.
>> > > > > > > > > > > 3. Bulk & delta iterations where the solution set is
>> the
>> > > > Graph:
>> > > > > > > this
>> > > > > > > > > will
>> > > > > > > > > > > cover more complex cases, where the algorithm updates
>> > both
>> > > > > > vertices
>> > > > > > > > and
>> > > > > > > > > > > edges or even adds/removes vertices/edges, i.e.
>> updates
>> > the
>> > > > > whole
>> > > > > > > > > Graph.
>> > > > > > > > > > >
>> > > > > > > > > > > What do you think? I can see 1 & 2 being very easy to
>> > > > > implement,
>> > > > > > > but
>> > > > > > > > I
>> > > > > > > > > > > suspect 3 won't be that easy (but so awesome to have
>> ^^).
>> > > > > > > > > > > Would it work the way a Graph is represented now, i.e.
>> > > with 2
>> > > > > > > > DataSets?
>> > > > > > > > > > >
>> > > > > > > > > > > Any comment, idea, pointer would be much appreciated!
>> > Thank
>> > > > you
>> > > > > > ^^
>> > > > > > > > > > >
>> > > > > > > > > > > Cheers,
>> > > > > > > > > > > -V.
>> > > > > > > > > > >
>> > > > > > > > > > > [1]:
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
>> > > > > > > > > > > [2]:
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Vasiliki Kalavri <va...@gmail.com>.

I see, thanks a lot for the answers!
To rephrase my original question, would it make sense to define a
closed-loop iteration where the state is the whole graph?

If you want to take a look at the current implementation of DMST using
delta iteration, Andra has made a PR [1].
On a high-level, this algorithm does (more or less) the following:

while there are more than one vertices in the graph
   for every vertex
      select the min-weight edge
      add all selected edges to the MST
   collapse vertices with the same root into one vertex (iterative
connected components-like step)

What we are thinking is that we could express this with -maybe- a bulk-like
iteration on the Graph, inside which we can use Gelly methods.
Would it make sense or shall we go for a for-loop implementation instead?

Thanks!
-V.

[1]: http://github.com/apache/flink/pull/434

On 23 February 2015 at 16:18, Stephan Ewen <se...@apache.org> wrote:

> Closed-loop iterations are much more efficient right now. Long for loops
> suffer from memory fragmentation (an issue that is in the list to fix).
>
> Also, closed loops can be stateful (delta iterations) and do not require
> task re-deployment.
>
> On Mon, Feb 23, 2015 at 4:15 PM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com
> > wrote:
>
> > I see that's cool :-)
> > So, what is the advantage of closed-loop versus for-loop iterations?
> > Custom convergence criteria / aggregators and more efficient execution
> > plans?
> >
> > On 23 February 2015 at 15:01, Stephan Ewen <se...@apache.org> wrote:
> >
> > > For loops are basically rolled out - they yield long execution plans.
> > >
> > > On Mon, Feb 23, 2015 at 2:44 PM, Vasiliki Kalavri <
> > > vasilikikalavri@gmail.com
> > > > wrote:
> > >
> > > > for-loop iterations could cover some cases, I guess, when the number
> of
> > > > iterations is known beforehand.
> > > > Are there currently any restrictions on what can be used inside a
> > > for-loop?
> > > > How are they translated into execution plans?
> > > >
> > > > On 23 February 2015 at 13:08, Stephan Ewen <se...@apache.org> wrote:
> > > >
> > > > > Some things may not work well as "closed-loop" iterations.
> > > > >
> > > > > Is it possible to express those as for-loop iterations?
> > > > >
> > > > > On Mon, Feb 23, 2015 at 1:03 PM, Vasiliki Kalavri <
> > > > > vasilikikalavri@gmail.com
> > > > > > wrote:
> > > > >
> > > > > > Hi Stephan,
> > > > > >
> > > > > > yes, this would work for the cases where an algorithm only
> updates
> > > the
> > > > > > vertex values or only updates the edge values.
> > > > > >
> > > > > > What we would like to also support is
> > > > > > (a) algorithms where both vertices and edges are updated in one
> > > > iteration
> > > > > > (b) algorithms where the graph structure changes from one
> iteration
> > > to
> > > > > the
> > > > > > next and
> > > > > > (c) branching inside an iteration, i.e. executing a different
> > > > "iteration
> > > > > > body" based on some condition.
> > > > > >
> > > > > > We can still currently implement those with regular Flink
> iteration
> > > > > > operators, but the resulting code is not that nice or efficient.
> > > > > > For example, if we want to update both edges and vertex values,
> we
> > > can
> > > > > > still create a solution set where the vertex values are attached
> to
> > > > each
> > > > > > edge.
> > > > > > Regarding branching inside an iteration, we can use an aggregator
> > > that
> > > > > > tracks the iteration phase, but then we need to somehow make the
> > > > > different
> > > > > > phases to consist of the same operators and also check the
> > branching
> > > > > > condition inside each UDF.
> > > > > >
> > > > > > Cheers,
> > > > > > V.
> > > > > >
> > > > > >
> > > > > > On 23 February 2015 at 11:05, Stephan Ewen <se...@apache.org>
> > wrote:
> > > > > >
> > > > > > > As a workaround, it should always work to get the Edge and
> Vertex
> > > > data
> > > > > > set
> > > > > > > from the graph and use the regular Fink iteration operators?
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <
> > > > > > > vasilikikalavri@gmail.com
> > > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > yes, I was referring to the parallel Boruvka algorithm. There
> > are
> > > > > > several
> > > > > > > > ways to implement this one in Flink and I believe that the
> one
> > > > > > described
> > > > > > > in
> > > > > > > > the paper (vertex-centric) is not the most elegant one :)
> > > > > > > >
> > > > > > > > Andra is now working on an idea that uses the delta iteration
> > > > > > abstraction
> > > > > > > > and we believe that it will be both more efficient and easier
> > to
> > > > > > > > understand. It has the edges in the solution set and the
> > vertices
> > > > in
> > > > > > the
> > > > > > > > workset, so it follows the pattern I describe in (2) in my
> > > previous
> > > > > > > e-mail.
> > > > > > > > As a next step, we would like to see how having an iteration
> > > > operator
> > > > > > > that
> > > > > > > > could update the whole graph -what I describe as (3)- would
> > make
> > > > this
> > > > > > > even
> > > > > > > > nicer.
> > > > > > > >
> > > > > > > > Any ideas are highly welcome!
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > V.
> > > > > > > >
> > > > > > > > On 22 February 2015 at 16:32, Andra Lungu <
> > lungu.andra@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Alex,
> > > > > > > > >
> > > > > > > > > Vasia is talking about the second version(presented Friday)
> > of
> > > > > > Parallel
> > > > > > > > > Boruvka, which can be found here:
> > > > > > > > > https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
> > > > > > > > >
> > > > > > > > > I will propose the third, non-Pregel like approach directly
> > to
> > > > > Gelly
> > > > > > > > soon.
> > > > > > > > >
> > > > > > > > > If you have additional questions, I will be happy to answer
> > > them.
> > > > > > > > >
> > > > > > > > > Andra
> > > > > > > > >
> > > > > > > > > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov <
> > > > > > > > > alexander.s.alexandrov@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Hi Vasia,
> > > > > > > > > >
> > > > > > > > > > I am trying to look at the problem in more detail. Which
> > > > version
> > > > > of
> > > > > > > the
> > > > > > > > > MST
> > > > > > > > > > are you talking about?
> > > > > > > > > >
> > > > > > > > > > Right now in the Gelly repository I can only find the
> SSSP
> > > > > example
> > > > > > > > > > (parallel Bellman-Ford) from Section 4.2 in [1].
> > > > > > > > > >
> > > > > > > > > > However, it seems that the issues encountered by Andra
> are
> > > > > related
> > > > > > to
> > > > > > > > the
> > > > > > > > > > implementation of Parallel Boruvka (Section 3.2 in [2]).
> Is
> > > > that
> > > > > > > > correct?
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > A.
> > > > > > > > > >
> > > > > > > > > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> > > > > > > > > > [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
> > > > > > > > > >
> > > > > > > > > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <
> > > > > > > vasilikikalavri@gmail.com
> > > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hello beautiful Flink people,
> > > > > > > > > > >
> > > > > > > > > > > during the past few days, Andra and I have been
> > discussing
> > > > > about
> > > > > > > how
> > > > > > > > to
> > > > > > > > > > > extend Gelly's iteration methods.
> > > > > > > > > > >
> > > > > > > > > > > Alexander's course (and his awesome students) has made
> it
> > > > > obvious
> > > > > > > > that
> > > > > > > > > > > vertex-centric iterations are not the best fit for
> > > algorithms
> > > > > > which
> > > > > > > > > don't
> > > > > > > > > > > follow the common "propagate-update" pattern. For
> > example,
> > > > > Andra
> > > > > > is
> > > > > > > > > > working
> > > > > > > > > > > on an implementation of Minimum Spanning Tree, which
> > > requires
> > > > > > > > branching
> > > > > > > > > > > inside an iteration and also requires a convergence
> check
> > > of
> > > > an
> > > > > > > > > internal
> > > > > > > > > > > iteration. Others also reported similar issues [1, 2].
> > > Trying
> > > > > to
> > > > > > > fit
> > > > > > > > > such
> > > > > > > > > > > algorithms to the vertex-centric model leads to long
> and
> > > ugly
> > > > > > code,
> > > > > > > > > e.g.
> > > > > > > > > > > aggregators to keep track of algorithm phases,
> > duplicating
> > > > > data,
> > > > > > > etc.
> > > > > > > > > > >
> > > > > > > > > > > One limitation of the vertex-centric and the upcoming
> GAS
> > > > model
> > > > > > is
> > > > > > > > that
> > > > > > > > > > > they both only allow the vertex values to be updated in
> > > each
> > > > > > > > iteration.
> > > > > > > > > > > However, for some algorithms we need to update the edge
> > > > values
> > > > > > and
> > > > > > > in
> > > > > > > > > > > others we need to update both. In even more complex
> > > > situations
> > > > > > > (like
> > > > > > > > > > > Andra's MST) in some iterations we need to update the
> > > vertex
> > > > > > values
> > > > > > > > and
> > > > > > > > > > in
> > > > > > > > > > > some iterations we need to update the edge values.
> > > > > > > > > > > Another problem is that we currently don't have a way
> to
> > > > allow
> > > > > > > > > different
> > > > > > > > > > > computational phases inside an iteration. This is
> > something
> > > > > that
> > > > > > > > Giraph
> > > > > > > > > > > solves with master compute, a function that is executed
> > > once
> > > > > > before
> > > > > > > > > each
> > > > > > > > > > > superstep and sets the computation function.
> > > > > > > > > > >
> > > > > > > > > > > All that said, I believe that we can solve most of
> these
> > > > issues
> > > > > > if
> > > > > > > we
> > > > > > > > > > > nicely expose Flink's iteration operators in Gelly. I
> can
> > > see
> > > > > the
> > > > > > > > > > following
> > > > > > > > > > > cases:
> > > > > > > > > > >
> > > > > > > > > > > 1. Bulk & delta iterations where the solution set is
> the
> > > > vertex
> > > > > > > > > dataset:
> > > > > > > > > > > this will be similar to vertex-centric and GAS, but
> will
> > > > allow
> > > > > > more
> > > > > > > > > > > flexible dataflows inside the iteration.
> > > > > > > > > > > 2. Bulk & delta iterations where the solution set is
> the
> > > edge
> > > > > > > > dataset:
> > > > > > > > > > for
> > > > > > > > > > > the cases where we need to update edge values.
> > > > > > > > > > > 3. Bulk & delta iterations where the solution set is
> the
> > > > Graph:
> > > > > > > this
> > > > > > > > > will
> > > > > > > > > > > cover more complex cases, where the algorithm updates
> > both
> > > > > > vertices
> > > > > > > > and
> > > > > > > > > > > edges or even adds/removes vertices/edges, i.e. updates
> > the
> > > > > whole
> > > > > > > > > Graph.
> > > > > > > > > > >
> > > > > > > > > > > What do you think? I can see 1 & 2 being very easy to
> > > > > implement,
> > > > > > > but
> > > > > > > > I
> > > > > > > > > > > suspect 3 won't be that easy (but so awesome to have
> ^^).
> > > > > > > > > > > Would it work the way a Graph is represented now, i.e.
> > > with 2
> > > > > > > > DataSets?
> > > > > > > > > > >
> > > > > > > > > > > Any comment, idea, pointer would be much appreciated!
> > Thank
> > > > you
> > > > > > ^^
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > -V.
> > > > > > > > > > >
> > > > > > > > > > > [1]:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> > > > > > > > > > > [2]:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Stephan Ewen <se...@apache.org>.

Closed-loop iterations are much more efficient right now. Long for loops
suffer from memory fragmentation (an issue that is in the list to fix).

Also, closed loops can be stateful (delta iterations) and do not require
task re-deployment.

On Mon, Feb 23, 2015 at 4:15 PM, Vasiliki Kalavri <vasilikikalavri@gmail.com
> wrote:

> I see that's cool :-)
> So, what is the advantage of closed-loop versus for-loop iterations?
> Custom convergence criteria / aggregators and more efficient execution
> plans?
>
> On 23 February 2015 at 15:01, Stephan Ewen <se...@apache.org> wrote:
>
> > For loops are basically rolled out - they yield long execution plans.
> >
> > On Mon, Feb 23, 2015 at 2:44 PM, Vasiliki Kalavri <
> > vasilikikalavri@gmail.com
> > > wrote:
> >
> > > for-loop iterations could cover some cases, I guess, when the number of
> > > iterations is known beforehand.
> > > Are there currently any restrictions on what can be used inside a
> > for-loop?
> > > How are they translated into execution plans?
> > >
> > > On 23 February 2015 at 13:08, Stephan Ewen <se...@apache.org> wrote:
> > >
> > > > Some things may not work well as "closed-loop" iterations.
> > > >
> > > > Is it possible to express those as for-loop iterations?
> > > >
> > > > On Mon, Feb 23, 2015 at 1:03 PM, Vasiliki Kalavri <
> > > > vasilikikalavri@gmail.com
> > > > > wrote:
> > > >
> > > > > Hi Stephan,
> > > > >
> > > > > yes, this would work for the cases where an algorithm only updates
> > the
> > > > > vertex values or only updates the edge values.
> > > > >
> > > > > What we would like to also support is
> > > > > (a) algorithms where both vertices and edges are updated in one
> > > iteration
> > > > > (b) algorithms where the graph structure changes from one iteration
> > to
> > > > the
> > > > > next and
> > > > > (c) branching inside an iteration, i.e. executing a different
> > > "iteration
> > > > > body" based on some condition.
> > > > >
> > > > > We can still currently implement those with regular Flink iteration
> > > > > operators, but the resulting code is not that nice or efficient.
> > > > > For example, if we want to update both edges and vertex values, we
> > can
> > > > > still create a solution set where the vertex values are attached to
> > > each
> > > > > edge.
> > > > > Regarding branching inside an iteration, we can use an aggregator
> > that
> > > > > tracks the iteration phase, but then we need to somehow make the
> > > > different
> > > > > phases to consist of the same operators and also check the
> branching
> > > > > condition inside each UDF.
> > > > >
> > > > > Cheers,
> > > > > V.
> > > > >
> > > > >
> > > > > On 23 February 2015 at 11:05, Stephan Ewen <se...@apache.org>
> wrote:
> > > > >
> > > > > > As a workaround, it should always work to get the Edge and Vertex
> > > data
> > > > > set
> > > > > > from the graph and use the regular Fink iteration operators?
> > > > > >
> > > > > >
> > > > > > On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <
> > > > > > vasilikikalavri@gmail.com
> > > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > yes, I was referring to the parallel Boruvka algorithm. There
> are
> > > > > several
> > > > > > > ways to implement this one in Flink and I believe that the one
> > > > > described
> > > > > > in
> > > > > > > the paper (vertex-centric) is not the most elegant one :)
> > > > > > >
> > > > > > > Andra is now working on an idea that uses the delta iteration
> > > > > abstraction
> > > > > > > and we believe that it will be both more efficient and easier
> to
> > > > > > > understand. It has the edges in the solution set and the
> vertices
> > > in
> > > > > the
> > > > > > > workset, so it follows the pattern I describe in (2) in my
> > previous
> > > > > > e-mail.
> > > > > > > As a next step, we would like to see how having an iteration
> > > operator
> > > > > > that
> > > > > > > could update the whole graph -what I describe as (3)- would
> make
> > > this
> > > > > > even
> > > > > > > nicer.
> > > > > > >
> > > > > > > Any ideas are highly welcome!
> > > > > > >
> > > > > > > Cheers,
> > > > > > > V.
> > > > > > >
> > > > > > > On 22 February 2015 at 16:32, Andra Lungu <
> lungu.andra@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Alex,
> > > > > > > >
> > > > > > > > Vasia is talking about the second version(presented Friday)
> of
> > > > > Parallel
> > > > > > > > Boruvka, which can be found here:
> > > > > > > > https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
> > > > > > > >
> > > > > > > > I will propose the third, non-Pregel like approach directly
> to
> > > > Gelly
> > > > > > > soon.
> > > > > > > >
> > > > > > > > If you have additional questions, I will be happy to answer
> > them.
> > > > > > > >
> > > > > > > > Andra
> > > > > > > >
> > > > > > > > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov <
> > > > > > > > alexander.s.alexandrov@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Hi Vasia,
> > > > > > > > >
> > > > > > > > > I am trying to look at the problem in more detail. Which
> > > version
> > > > of
> > > > > > the
> > > > > > > > MST
> > > > > > > > > are you talking about?
> > > > > > > > >
> > > > > > > > > Right now in the Gelly repository I can only find the SSSP
> > > > example
> > > > > > > > > (parallel Bellman-Ford) from Section 4.2 in [1].
> > > > > > > > >
> > > > > > > > > However, it seems that the issues encountered by Andra are
> > > > related
> > > > > to
> > > > > > > the
> > > > > > > > > implementation of Parallel Boruvka (Section 3.2 in [2]). Is
> > > that
> > > > > > > correct?
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > A.
> > > > > > > > >
> > > > > > > > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> > > > > > > > > [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
> > > > > > > > >
> > > > > > > > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <
> > > > > > vasilikikalavri@gmail.com
> > > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hello beautiful Flink people,
> > > > > > > > > >
> > > > > > > > > > during the past few days, Andra and I have been
> discussing
> > > > about
> > > > > > how
> > > > > > > to
> > > > > > > > > > extend Gelly's iteration methods.
> > > > > > > > > >
> > > > > > > > > > Alexander's course (and his awesome students) has made it
> > > > obvious
> > > > > > > that
> > > > > > > > > > vertex-centric iterations are not the best fit for
> > algorithms
> > > > > which
> > > > > > > > don't
> > > > > > > > > > follow the common "propagate-update" pattern. For
> example,
> > > > Andra
> > > > > is
> > > > > > > > > working
> > > > > > > > > > on an implementation of Minimum Spanning Tree, which
> > requires
> > > > > > > branching
> > > > > > > > > > inside an iteration and also requires a convergence check
> > of
> > > an
> > > > > > > > internal
> > > > > > > > > > iteration. Others also reported similar issues [1, 2].
> > Trying
> > > > to
> > > > > > fit
> > > > > > > > such
> > > > > > > > > > algorithms to the vertex-centric model leads to long and
> > ugly
> > > > > code,
> > > > > > > > e.g.
> > > > > > > > > > aggregators to keep track of algorithm phases,
> duplicating
> > > > data,
> > > > > > etc.
> > > > > > > > > >
> > > > > > > > > > One limitation of the vertex-centric and the upcoming GAS
> > > model
> > > > > is
> > > > > > > that
> > > > > > > > > > they both only allow the vertex values to be updated in
> > each
> > > > > > > iteration.
> > > > > > > > > > However, for some algorithms we need to update the edge
> > > values
> > > > > and
> > > > > > in
> > > > > > > > > > others we need to update both. In even more complex
> > > situations
> > > > > > (like
> > > > > > > > > > Andra's MST) in some iterations we need to update the
> > vertex
> > > > > values
> > > > > > > and
> > > > > > > > > in
> > > > > > > > > > some iterations we need to update the edge values.
> > > > > > > > > > Another problem is that we currently don't have a way to
> > > allow
> > > > > > > > different
> > > > > > > > > > computational phases inside an iteration. This is
> something
> > > > that
> > > > > > > Giraph
> > > > > > > > > > solves with master compute, a function that is executed
> > once
> > > > > before
> > > > > > > > each
> > > > > > > > > > superstep and sets the computation function.
> > > > > > > > > >
> > > > > > > > > > All that said, I believe that we can solve most of these
> > > issues
> > > > > if
> > > > > > we
> > > > > > > > > > nicely expose Flink's iteration operators in Gelly. I can
> > see
> > > > the
> > > > > > > > > following
> > > > > > > > > > cases:
> > > > > > > > > >
> > > > > > > > > > 1. Bulk & delta iterations where the solution set is the
> > > vertex
> > > > > > > > dataset:
> > > > > > > > > > this will be similar to vertex-centric and GAS, but will
> > > allow
> > > > > more
> > > > > > > > > > flexible dataflows inside the iteration.
> > > > > > > > > > 2. Bulk & delta iterations where the solution set is the
> > edge
> > > > > > > dataset:
> > > > > > > > > for
> > > > > > > > > > the cases where we need to update edge values.
> > > > > > > > > > 3. Bulk & delta iterations where the solution set is the
> > > Graph:
> > > > > > this
> > > > > > > > will
> > > > > > > > > > cover more complex cases, where the algorithm updates
> both
> > > > > vertices
> > > > > > > and
> > > > > > > > > > edges or even adds/removes vertices/edges, i.e. updates
> the
> > > > whole
> > > > > > > > Graph.
> > > > > > > > > >
> > > > > > > > > > What do you think? I can see 1 & 2 being very easy to
> > > > implement,
> > > > > > but
> > > > > > > I
> > > > > > > > > > suspect 3 won't be that easy (but so awesome to have ^^).
> > > > > > > > > > Would it work the way a Graph is represented now, i.e.
> > with 2
> > > > > > > DataSets?
> > > > > > > > > >
> > > > > > > > > > Any comment, idea, pointer would be much appreciated!
> Thank
> > > you
> > > > > ^^
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > -V.
> > > > > > > > > >
> > > > > > > > > > [1]:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> > > > > > > > > > [2]:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Vasiliki Kalavri <va...@gmail.com>.

I see that's cool :-)
So, what is the advantage of closed-loop versus for-loop iterations?
Custom convergence criteria / aggregators and more efficient execution
plans?

On 23 February 2015 at 15:01, Stephan Ewen <se...@apache.org> wrote:

> For loops are basically rolled out - they yield long execution plans.
>
> On Mon, Feb 23, 2015 at 2:44 PM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com
> > wrote:
>
> > for-loop iterations could cover some cases, I guess, when the number of
> > iterations is known beforehand.
> > Are there currently any restrictions on what can be used inside a
> for-loop?
> > How are they translated into execution plans?
> >
> > On 23 February 2015 at 13:08, Stephan Ewen <se...@apache.org> wrote:
> >
> > > Some things may not work well as "closed-loop" iterations.
> > >
> > > Is it possible to express those as for-loop iterations?
> > >
> > > On Mon, Feb 23, 2015 at 1:03 PM, Vasiliki Kalavri <
> > > vasilikikalavri@gmail.com
> > > > wrote:
> > >
> > > > Hi Stephan,
> > > >
> > > > yes, this would work for the cases where an algorithm only updates
> the
> > > > vertex values or only updates the edge values.
> > > >
> > > > What we would like to also support is
> > > > (a) algorithms where both vertices and edges are updated in one
> > iteration
> > > > (b) algorithms where the graph structure changes from one iteration
> to
> > > the
> > > > next and
> > > > (c) branching inside an iteration, i.e. executing a different
> > "iteration
> > > > body" based on some condition.
> > > >
> > > > We can still currently implement those with regular Flink iteration
> > > > operators, but the resulting code is not that nice or efficient.
> > > > For example, if we want to update both edges and vertex values, we
> can
> > > > still create a solution set where the vertex values are attached to
> > each
> > > > edge.
> > > > Regarding branching inside an iteration, we can use an aggregator
> that
> > > > tracks the iteration phase, but then we need to somehow make the
> > > different
> > > > phases to consist of the same operators and also check the branching
> > > > condition inside each UDF.
> > > >
> > > > Cheers,
> > > > V.
> > > >
> > > >
> > > > On 23 February 2015 at 11:05, Stephan Ewen <se...@apache.org> wrote:
> > > >
> > > > > As a workaround, it should always work to get the Edge and Vertex
> > data
> > > > set
> > > > > from the graph and use the regular Fink iteration operators?
> > > > >
> > > > >
> > > > > On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <
> > > > > vasilikikalavri@gmail.com
> > > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > yes, I was referring to the parallel Boruvka algorithm. There are
> > > > several
> > > > > > ways to implement this one in Flink and I believe that the one
> > > > described
> > > > > in
> > > > > > the paper (vertex-centric) is not the most elegant one :)
> > > > > >
> > > > > > Andra is now working on an idea that uses the delta iteration
> > > > abstraction
> > > > > > and we believe that it will be both more efficient and easier to
> > > > > > understand. It has the edges in the solution set and the vertices
> > in
> > > > the
> > > > > > workset, so it follows the pattern I describe in (2) in my
> previous
> > > > > e-mail.
> > > > > > As a next step, we would like to see how having an iteration
> > operator
> > > > > that
> > > > > > could update the whole graph -what I describe as (3)- would make
> > this
> > > > > even
> > > > > > nicer.
> > > > > >
> > > > > > Any ideas are highly welcome!
> > > > > >
> > > > > > Cheers,
> > > > > > V.
> > > > > >
> > > > > > On 22 February 2015 at 16:32, Andra Lungu <lungu.andra@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Alex,
> > > > > > >
> > > > > > > Vasia is talking about the second version(presented Friday) of
> > > > Parallel
> > > > > > > Boruvka, which can be found here:
> > > > > > > https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
> > > > > > >
> > > > > > > I will propose the third, non-Pregel like approach directly to
> > > Gelly
> > > > > > soon.
> > > > > > >
> > > > > > > If you have additional questions, I will be happy to answer
> them.
> > > > > > >
> > > > > > > Andra
> > > > > > >
> > > > > > > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov <
> > > > > > > alexander.s.alexandrov@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hi Vasia,
> > > > > > > >
> > > > > > > > I am trying to look at the problem in more detail. Which
> > version
> > > of
> > > > > the
> > > > > > > MST
> > > > > > > > are you talking about?
> > > > > > > >
> > > > > > > > Right now in the Gelly repository I can only find the SSSP
> > > example
> > > > > > > > (parallel Bellman-Ford) from Section 4.2 in [1].
> > > > > > > >
> > > > > > > > However, it seems that the issues encountered by Andra are
> > > related
> > > > to
> > > > > > the
> > > > > > > > implementation of Parallel Boruvka (Section 3.2 in [2]). Is
> > that
> > > > > > correct?
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > A.
> > > > > > > >
> > > > > > > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> > > > > > > > [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
> > > > > > > >
> > > > > > > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <
> > > > > vasilikikalavri@gmail.com
> > > > > > >:
> > > > > > > >
> > > > > > > > > Hello beautiful Flink people,
> > > > > > > > >
> > > > > > > > > during the past few days, Andra and I have been discussing
> > > about
> > > > > how
> > > > > > to
> > > > > > > > > extend Gelly's iteration methods.
> > > > > > > > >
> > > > > > > > > Alexander's course (and his awesome students) has made it
> > > obvious
> > > > > > that
> > > > > > > > > vertex-centric iterations are not the best fit for
> algorithms
> > > > which
> > > > > > > don't
> > > > > > > > > follow the common "propagate-update" pattern. For example,
> > > Andra
> > > > is
> > > > > > > > working
> > > > > > > > > on an implementation of Minimum Spanning Tree, which
> requires
> > > > > > branching
> > > > > > > > > inside an iteration and also requires a convergence check
> of
> > an
> > > > > > > internal
> > > > > > > > > iteration. Others also reported similar issues [1, 2].
> Trying
> > > to
> > > > > fit
> > > > > > > such
> > > > > > > > > algorithms to the vertex-centric model leads to long and
> ugly
> > > > code,
> > > > > > > e.g.
> > > > > > > > > aggregators to keep track of algorithm phases, duplicating
> > > data,
> > > > > etc.
> > > > > > > > >
> > > > > > > > > One limitation of the vertex-centric and the upcoming GAS
> > model
> > > > is
> > > > > > that
> > > > > > > > > they both only allow the vertex values to be updated in
> each
> > > > > > iteration.
> > > > > > > > > However, for some algorithms we need to update the edge
> > values
> > > > and
> > > > > in
> > > > > > > > > others we need to update both. In even more complex
> > situations
> > > > > (like
> > > > > > > > > Andra's MST) in some iterations we need to update the
> vertex
> > > > values
> > > > > > and
> > > > > > > > in
> > > > > > > > > some iterations we need to update the edge values.
> > > > > > > > > Another problem is that we currently don't have a way to
> > allow
> > > > > > > different
> > > > > > > > > computational phases inside an iteration. This is something
> > > that
> > > > > > Giraph
> > > > > > > > > solves with master compute, a function that is executed
> once
> > > > before
> > > > > > > each
> > > > > > > > > superstep and sets the computation function.
> > > > > > > > >
> > > > > > > > > All that said, I believe that we can solve most of these
> > issues
> > > > if
> > > > > we
> > > > > > > > > nicely expose Flink's iteration operators in Gelly. I can
> see
> > > the
> > > > > > > > following
> > > > > > > > > cases:
> > > > > > > > >
> > > > > > > > > 1. Bulk & delta iterations where the solution set is the
> > vertex
> > > > > > > dataset:
> > > > > > > > > this will be similar to vertex-centric and GAS, but will
> > allow
> > > > more
> > > > > > > > > flexible dataflows inside the iteration.
> > > > > > > > > 2. Bulk & delta iterations where the solution set is the
> edge
> > > > > > dataset:
> > > > > > > > for
> > > > > > > > > the cases where we need to update edge values.
> > > > > > > > > 3. Bulk & delta iterations where the solution set is the
> > Graph:
> > > > > this
> > > > > > > will
> > > > > > > > > cover more complex cases, where the algorithm updates both
> > > > vertices
> > > > > > and
> > > > > > > > > edges or even adds/removes vertices/edges, i.e. updates the
> > > whole
> > > > > > > Graph.
> > > > > > > > >
> > > > > > > > > What do you think? I can see 1 & 2 being very easy to
> > > implement,
> > > > > but
> > > > > > I
> > > > > > > > > suspect 3 won't be that easy (but so awesome to have ^^).
> > > > > > > > > Would it work the way a Graph is represented now, i.e.
> with 2
> > > > > > DataSets?
> > > > > > > > >
> > > > > > > > > Any comment, idea, pointer would be much appreciated! Thank
> > you
> > > > ^^
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > -V.
> > > > > > > > >
> > > > > > > > > [1]:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> > > > > > > > > [2]:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Stephan Ewen <se...@apache.org>.

For loops are basically rolled out - they yield long execution plans.

On Mon, Feb 23, 2015 at 2:44 PM, Vasiliki Kalavri <vasilikikalavri@gmail.com
> wrote:

> for-loop iterations could cover some cases, I guess, when the number of
> iterations is known beforehand.
> Are there currently any restrictions on what can be used inside a for-loop?
> How are they translated into execution plans?
>
> On 23 February 2015 at 13:08, Stephan Ewen <se...@apache.org> wrote:
>
> > Some things may not work well as "closed-loop" iterations.
> >
> > Is it possible to express those as for-loop iterations?
> >
> > On Mon, Feb 23, 2015 at 1:03 PM, Vasiliki Kalavri <
> > vasilikikalavri@gmail.com
> > > wrote:
> >
> > > Hi Stephan,
> > >
> > > yes, this would work for the cases where an algorithm only updates the
> > > vertex values or only updates the edge values.
> > >
> > > What we would like to also support is
> > > (a) algorithms where both vertices and edges are updated in one
> iteration
> > > (b) algorithms where the graph structure changes from one iteration to
> > the
> > > next and
> > > (c) branching inside an iteration, i.e. executing a different
> "iteration
> > > body" based on some condition.
> > >
> > > We can still currently implement those with regular Flink iteration
> > > operators, but the resulting code is not that nice or efficient.
> > > For example, if we want to update both edges and vertex values, we can
> > > still create a solution set where the vertex values are attached to
> each
> > > edge.
> > > Regarding branching inside an iteration, we can use an aggregator that
> > > tracks the iteration phase, but then we need to somehow make the
> > different
> > > phases to consist of the same operators and also check the branching
> > > condition inside each UDF.
> > >
> > > Cheers,
> > > V.
> > >
> > >
> > > On 23 February 2015 at 11:05, Stephan Ewen <se...@apache.org> wrote:
> > >
> > > > As a workaround, it should always work to get the Edge and Vertex
> data
> > > set
> > > > from the graph and use the regular Fink iteration operators?
> > > >
> > > >
> > > > On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <
> > > > vasilikikalavri@gmail.com
> > > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > yes, I was referring to the parallel Boruvka algorithm. There are
> > > several
> > > > > ways to implement this one in Flink and I believe that the one
> > > described
> > > > in
> > > > > the paper (vertex-centric) is not the most elegant one :)
> > > > >
> > > > > Andra is now working on an idea that uses the delta iteration
> > > abstraction
> > > > > and we believe that it will be both more efficient and easier to
> > > > > understand. It has the edges in the solution set and the vertices
> in
> > > the
> > > > > workset, so it follows the pattern I describe in (2) in my previous
> > > > e-mail.
> > > > > As a next step, we would like to see how having an iteration
> operator
> > > > that
> > > > > could update the whole graph -what I describe as (3)- would make
> this
> > > > even
> > > > > nicer.
> > > > >
> > > > > Any ideas are highly welcome!
> > > > >
> > > > > Cheers,
> > > > > V.
> > > > >
> > > > > On 22 February 2015 at 16:32, Andra Lungu <lu...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Alex,
> > > > > >
> > > > > > Vasia is talking about the second version(presented Friday) of
> > > Parallel
> > > > > > Boruvka, which can be found here:
> > > > > > https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
> > > > > >
> > > > > > I will propose the third, non-Pregel like approach directly to
> > Gelly
> > > > > soon.
> > > > > >
> > > > > > If you have additional questions, I will be happy to answer them.
> > > > > >
> > > > > > Andra
> > > > > >
> > > > > > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov <
> > > > > > alexander.s.alexandrov@gmail.com> wrote:
> > > > > >
> > > > > > > Hi Vasia,
> > > > > > >
> > > > > > > I am trying to look at the problem in more detail. Which
> version
> > of
> > > > the
> > > > > > MST
> > > > > > > are you talking about?
> > > > > > >
> > > > > > > Right now in the Gelly repository I can only find the SSSP
> > example
> > > > > > > (parallel Bellman-Ford) from Section 4.2 in [1].
> > > > > > >
> > > > > > > However, it seems that the issues encountered by Andra are
> > related
> > > to
> > > > > the
> > > > > > > implementation of Parallel Boruvka (Section 3.2 in [2]). Is
> that
> > > > > correct?
> > > > > > >
> > > > > > > Regards,
> > > > > > > A.
> > > > > > >
> > > > > > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> > > > > > > [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
> > > > > > >
> > > > > > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <
> > > > vasilikikalavri@gmail.com
> > > > > >:
> > > > > > >
> > > > > > > > Hello beautiful Flink people,
> > > > > > > >
> > > > > > > > during the past few days, Andra and I have been discussing
> > about
> > > > how
> > > > > to
> > > > > > > > extend Gelly's iteration methods.
> > > > > > > >
> > > > > > > > Alexander's course (and his awesome students) has made it
> > obvious
> > > > > that
> > > > > > > > vertex-centric iterations are not the best fit for algorithms
> > > which
> > > > > > don't
> > > > > > > > follow the common "propagate-update" pattern. For example,
> > Andra
> > > is
> > > > > > > working
> > > > > > > > on an implementation of Minimum Spanning Tree, which requires
> > > > > branching
> > > > > > > > inside an iteration and also requires a convergence check of
> an
> > > > > > internal
> > > > > > > > iteration. Others also reported similar issues [1, 2]. Trying
> > to
> > > > fit
> > > > > > such
> > > > > > > > algorithms to the vertex-centric model leads to long and ugly
> > > code,
> > > > > > e.g.
> > > > > > > > aggregators to keep track of algorithm phases, duplicating
> > data,
> > > > etc.
> > > > > > > >
> > > > > > > > One limitation of the vertex-centric and the upcoming GAS
> model
> > > is
> > > > > that
> > > > > > > > they both only allow the vertex values to be updated in each
> > > > > iteration.
> > > > > > > > However, for some algorithms we need to update the edge
> values
> > > and
> > > > in
> > > > > > > > others we need to update both. In even more complex
> situations
> > > > (like
> > > > > > > > Andra's MST) in some iterations we need to update the vertex
> > > values
> > > > > and
> > > > > > > in
> > > > > > > > some iterations we need to update the edge values.
> > > > > > > > Another problem is that we currently don't have a way to
> allow
> > > > > > different
> > > > > > > > computational phases inside an iteration. This is something
> > that
> > > > > Giraph
> > > > > > > > solves with master compute, a function that is executed once
> > > before
> > > > > > each
> > > > > > > > superstep and sets the computation function.
> > > > > > > >
> > > > > > > > All that said, I believe that we can solve most of these
> issues
> > > if
> > > > we
> > > > > > > > nicely expose Flink's iteration operators in Gelly. I can see
> > the
> > > > > > > following
> > > > > > > > cases:
> > > > > > > >
> > > > > > > > 1. Bulk & delta iterations where the solution set is the
> vertex
> > > > > > dataset:
> > > > > > > > this will be similar to vertex-centric and GAS, but will
> allow
> > > more
> > > > > > > > flexible dataflows inside the iteration.
> > > > > > > > 2. Bulk & delta iterations where the solution set is the edge
> > > > > dataset:
> > > > > > > for
> > > > > > > > the cases where we need to update edge values.
> > > > > > > > 3. Bulk & delta iterations where the solution set is the
> Graph:
> > > > this
> > > > > > will
> > > > > > > > cover more complex cases, where the algorithm updates both
> > > vertices
> > > > > and
> > > > > > > > edges or even adds/removes vertices/edges, i.e. updates the
> > whole
> > > > > > Graph.
> > > > > > > >
> > > > > > > > What do you think? I can see 1 & 2 being very easy to
> > implement,
> > > > but
> > > > > I
> > > > > > > > suspect 3 won't be that easy (but so awesome to have ^^).
> > > > > > > > Would it work the way a Graph is represented now, i.e. with 2
> > > > > DataSets?
> > > > > > > >
> > > > > > > > Any comment, idea, pointer would be much appreciated! Thank
> you
> > > ^^
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > -V.
> > > > > > > >
> > > > > > > > [1]:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> > > > > > > > [2]:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Vasiliki Kalavri <va...@gmail.com>.

for-loop iterations could cover some cases, I guess, when the number of
iterations is known beforehand.
Are there currently any restrictions on what can be used inside a for-loop?
How are they translated into execution plans?

On 23 February 2015 at 13:08, Stephan Ewen <se...@apache.org> wrote:

> Some things may not work well as "closed-loop" iterations.
>
> Is it possible to express those as for-loop iterations?
>
> On Mon, Feb 23, 2015 at 1:03 PM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com
> > wrote:
>
> > Hi Stephan,
> >
> > yes, this would work for the cases where an algorithm only updates the
> > vertex values or only updates the edge values.
> >
> > What we would like to also support is
> > (a) algorithms where both vertices and edges are updated in one iteration
> > (b) algorithms where the graph structure changes from one iteration to
> the
> > next and
> > (c) branching inside an iteration, i.e. executing a different "iteration
> > body" based on some condition.
> >
> > We can still currently implement those with regular Flink iteration
> > operators, but the resulting code is not that nice or efficient.
> > For example, if we want to update both edges and vertex values, we can
> > still create a solution set where the vertex values are attached to each
> > edge.
> > Regarding branching inside an iteration, we can use an aggregator that
> > tracks the iteration phase, but then we need to somehow make the
> different
> > phases to consist of the same operators and also check the branching
> > condition inside each UDF.
> >
> > Cheers,
> > V.
> >
> >
> > On 23 February 2015 at 11:05, Stephan Ewen <se...@apache.org> wrote:
> >
> > > As a workaround, it should always work to get the Edge and Vertex data
> > set
> > > from the graph and use the regular Fink iteration operators?
> > >
> > >
> > > On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <
> > > vasilikikalavri@gmail.com
> > > > wrote:
> > >
> > > > Hi,
> > > >
> > > > yes, I was referring to the parallel Boruvka algorithm. There are
> > several
> > > > ways to implement this one in Flink and I believe that the one
> > described
> > > in
> > > > the paper (vertex-centric) is not the most elegant one :)
> > > >
> > > > Andra is now working on an idea that uses the delta iteration
> > abstraction
> > > > and we believe that it will be both more efficient and easier to
> > > > understand. It has the edges in the solution set and the vertices in
> > the
> > > > workset, so it follows the pattern I describe in (2) in my previous
> > > e-mail.
> > > > As a next step, we would like to see how having an iteration operator
> > > that
> > > > could update the whole graph -what I describe as (3)- would make this
> > > even
> > > > nicer.
> > > >
> > > > Any ideas are highly welcome!
> > > >
> > > > Cheers,
> > > > V.
> > > >
> > > > On 22 February 2015 at 16:32, Andra Lungu <lu...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Alex,
> > > > >
> > > > > Vasia is talking about the second version(presented Friday) of
> > Parallel
> > > > > Boruvka, which can be found here:
> > > > > https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
> > > > >
> > > > > I will propose the third, non-Pregel like approach directly to
> Gelly
> > > > soon.
> > > > >
> > > > > If you have additional questions, I will be happy to answer them.
> > > > >
> > > > > Andra
> > > > >
> > > > > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov <
> > > > > alexander.s.alexandrov@gmail.com> wrote:
> > > > >
> > > > > > Hi Vasia,
> > > > > >
> > > > > > I am trying to look at the problem in more detail. Which version
> of
> > > the
> > > > > MST
> > > > > > are you talking about?
> > > > > >
> > > > > > Right now in the Gelly repository I can only find the SSSP
> example
> > > > > > (parallel Bellman-Ford) from Section 4.2 in [1].
> > > > > >
> > > > > > However, it seems that the issues encountered by Andra are
> related
> > to
> > > > the
> > > > > > implementation of Parallel Boruvka (Section 3.2 in [2]). Is that
> > > > correct?
> > > > > >
> > > > > > Regards,
> > > > > > A.
> > > > > >
> > > > > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> > > > > > [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
> > > > > >
> > > > > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <
> > > vasilikikalavri@gmail.com
> > > > >:
> > > > > >
> > > > > > > Hello beautiful Flink people,
> > > > > > >
> > > > > > > during the past few days, Andra and I have been discussing
> about
> > > how
> > > > to
> > > > > > > extend Gelly's iteration methods.
> > > > > > >
> > > > > > > Alexander's course (and his awesome students) has made it
> obvious
> > > > that
> > > > > > > vertex-centric iterations are not the best fit for algorithms
> > which
> > > > > don't
> > > > > > > follow the common "propagate-update" pattern. For example,
> Andra
> > is
> > > > > > working
> > > > > > > on an implementation of Minimum Spanning Tree, which requires
> > > > branching
> > > > > > > inside an iteration and also requires a convergence check of an
> > > > > internal
> > > > > > > iteration. Others also reported similar issues [1, 2]. Trying
> to
> > > fit
> > > > > such
> > > > > > > algorithms to the vertex-centric model leads to long and ugly
> > code,
> > > > > e.g.
> > > > > > > aggregators to keep track of algorithm phases, duplicating
> data,
> > > etc.
> > > > > > >
> > > > > > > One limitation of the vertex-centric and the upcoming GAS model
> > is
> > > > that
> > > > > > > they both only allow the vertex values to be updated in each
> > > > iteration.
> > > > > > > However, for some algorithms we need to update the edge values
> > and
> > > in
> > > > > > > others we need to update both. In even more complex situations
> > > (like
> > > > > > > Andra's MST) in some iterations we need to update the vertex
> > values
> > > > and
> > > > > > in
> > > > > > > some iterations we need to update the edge values.
> > > > > > > Another problem is that we currently don't have a way to allow
> > > > > different
> > > > > > > computational phases inside an iteration. This is something
> that
> > > > Giraph
> > > > > > > solves with master compute, a function that is executed once
> > before
> > > > > each
> > > > > > > superstep and sets the computation function.
> > > > > > >
> > > > > > > All that said, I believe that we can solve most of these issues
> > if
> > > we
> > > > > > > nicely expose Flink's iteration operators in Gelly. I can see
> the
> > > > > > following
> > > > > > > cases:
> > > > > > >
> > > > > > > 1. Bulk & delta iterations where the solution set is the vertex
> > > > > dataset:
> > > > > > > this will be similar to vertex-centric and GAS, but will allow
> > more
> > > > > > > flexible dataflows inside the iteration.
> > > > > > > 2. Bulk & delta iterations where the solution set is the edge
> > > > dataset:
> > > > > > for
> > > > > > > the cases where we need to update edge values.
> > > > > > > 3. Bulk & delta iterations where the solution set is the Graph:
> > > this
> > > > > will
> > > > > > > cover more complex cases, where the algorithm updates both
> > vertices
> > > > and
> > > > > > > edges or even adds/removes vertices/edges, i.e. updates the
> whole
> > > > > Graph.
> > > > > > >
> > > > > > > What do you think? I can see 1 & 2 being very easy to
> implement,
> > > but
> > > > I
> > > > > > > suspect 3 won't be that easy (but so awesome to have ^^).
> > > > > > > Would it work the way a Graph is represented now, i.e. with 2
> > > > DataSets?
> > > > > > >
> > > > > > > Any comment, idea, pointer would be much appreciated! Thank you
> > ^^
> > > > > > >
> > > > > > > Cheers,
> > > > > > > -V.
> > > > > > >
> > > > > > > [1]:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> > > > > > > [2]:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Stephan Ewen <se...@apache.org>.

Some things may not work well as "closed-loop" iterations.

Is it possible to express those as for-loop iterations?

On Mon, Feb 23, 2015 at 1:03 PM, Vasiliki Kalavri <vasilikikalavri@gmail.com
> wrote:

> Hi Stephan,
>
> yes, this would work for the cases where an algorithm only updates the
> vertex values or only updates the edge values.
>
> What we would like to also support is
> (a) algorithms where both vertices and edges are updated in one iteration
> (b) algorithms where the graph structure changes from one iteration to the
> next and
> (c) branching inside an iteration, i.e. executing a different "iteration
> body" based on some condition.
>
> We can still currently implement those with regular Flink iteration
> operators, but the resulting code is not that nice or efficient.
> For example, if we want to update both edges and vertex values, we can
> still create a solution set where the vertex values are attached to each
> edge.
> Regarding branching inside an iteration, we can use an aggregator that
> tracks the iteration phase, but then we need to somehow make the different
> phases to consist of the same operators and also check the branching
> condition inside each UDF.
>
> Cheers,
> V.
>
>
> On 23 February 2015 at 11:05, Stephan Ewen <se...@apache.org> wrote:
>
> > As a workaround, it should always work to get the Edge and Vertex data
> set
> > from the graph and use the regular Fink iteration operators?
> >
> >
> > On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <
> > vasilikikalavri@gmail.com
> > > wrote:
> >
> > > Hi,
> > >
> > > yes, I was referring to the parallel Boruvka algorithm. There are
> several
> > > ways to implement this one in Flink and I believe that the one
> described
> > in
> > > the paper (vertex-centric) is not the most elegant one :)
> > >
> > > Andra is now working on an idea that uses the delta iteration
> abstraction
> > > and we believe that it will be both more efficient and easier to
> > > understand. It has the edges in the solution set and the vertices in
> the
> > > workset, so it follows the pattern I describe in (2) in my previous
> > e-mail.
> > > As a next step, we would like to see how having an iteration operator
> > that
> > > could update the whole graph -what I describe as (3)- would make this
> > even
> > > nicer.
> > >
> > > Any ideas are highly welcome!
> > >
> > > Cheers,
> > > V.
> > >
> > > On 22 February 2015 at 16:32, Andra Lungu <lu...@gmail.com>
> wrote:
> > >
> > > > Hi Alex,
> > > >
> > > > Vasia is talking about the second version(presented Friday) of
> Parallel
> > > > Boruvka, which can be found here:
> > > > https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
> > > >
> > > > I will propose the third, non-Pregel like approach directly to Gelly
> > > soon.
> > > >
> > > > If you have additional questions, I will be happy to answer them.
> > > >
> > > > Andra
> > > >
> > > > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov <
> > > > alexander.s.alexandrov@gmail.com> wrote:
> > > >
> > > > > Hi Vasia,
> > > > >
> > > > > I am trying to look at the problem in more detail. Which version of
> > the
> > > > MST
> > > > > are you talking about?
> > > > >
> > > > > Right now in the Gelly repository I can only find the SSSP example
> > > > > (parallel Bellman-Ford) from Section 4.2 in [1].
> > > > >
> > > > > However, it seems that the issues encountered by Andra are related
> to
> > > the
> > > > > implementation of Parallel Boruvka (Section 3.2 in [2]). Is that
> > > correct?
> > > > >
> > > > > Regards,
> > > > > A.
> > > > >
> > > > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> > > > > [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
> > > > >
> > > > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <
> > vasilikikalavri@gmail.com
> > > >:
> > > > >
> > > > > > Hello beautiful Flink people,
> > > > > >
> > > > > > during the past few days, Andra and I have been discussing about
> > how
> > > to
> > > > > > extend Gelly's iteration methods.
> > > > > >
> > > > > > Alexander's course (and his awesome students) has made it obvious
> > > that
> > > > > > vertex-centric iterations are not the best fit for algorithms
> which
> > > > don't
> > > > > > follow the common "propagate-update" pattern. For example, Andra
> is
> > > > > working
> > > > > > on an implementation of Minimum Spanning Tree, which requires
> > > branching
> > > > > > inside an iteration and also requires a convergence check of an
> > > > internal
> > > > > > iteration. Others also reported similar issues [1, 2]. Trying to
> > fit
> > > > such
> > > > > > algorithms to the vertex-centric model leads to long and ugly
> code,
> > > > e.g.
> > > > > > aggregators to keep track of algorithm phases, duplicating data,
> > etc.
> > > > > >
> > > > > > One limitation of the vertex-centric and the upcoming GAS model
> is
> > > that
> > > > > > they both only allow the vertex values to be updated in each
> > > iteration.
> > > > > > However, for some algorithms we need to update the edge values
> and
> > in
> > > > > > others we need to update both. In even more complex situations
> > (like
> > > > > > Andra's MST) in some iterations we need to update the vertex
> values
> > > and
> > > > > in
> > > > > > some iterations we need to update the edge values.
> > > > > > Another problem is that we currently don't have a way to allow
> > > > different
> > > > > > computational phases inside an iteration. This is something that
> > > Giraph
> > > > > > solves with master compute, a function that is executed once
> before
> > > > each
> > > > > > superstep and sets the computation function.
> > > > > >
> > > > > > All that said, I believe that we can solve most of these issues
> if
> > we
> > > > > > nicely expose Flink's iteration operators in Gelly. I can see the
> > > > > following
> > > > > > cases:
> > > > > >
> > > > > > 1. Bulk & delta iterations where the solution set is the vertex
> > > > dataset:
> > > > > > this will be similar to vertex-centric and GAS, but will allow
> more
> > > > > > flexible dataflows inside the iteration.
> > > > > > 2. Bulk & delta iterations where the solution set is the edge
> > > dataset:
> > > > > for
> > > > > > the cases where we need to update edge values.
> > > > > > 3. Bulk & delta iterations where the solution set is the Graph:
> > this
> > > > will
> > > > > > cover more complex cases, where the algorithm updates both
> vertices
> > > and
> > > > > > edges or even adds/removes vertices/edges, i.e. updates the whole
> > > > Graph.
> > > > > >
> > > > > > What do you think? I can see 1 & 2 being very easy to implement,
> > but
> > > I
> > > > > > suspect 3 won't be that easy (but so awesome to have ^^).
> > > > > > Would it work the way a Graph is represented now, i.e. with 2
> > > DataSets?
> > > > > >
> > > > > > Any comment, idea, pointer would be much appreciated! Thank you
> ^^
> > > > > >
> > > > > > Cheers,
> > > > > > -V.
> > > > > >
> > > > > > [1]:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> > > > > > [2]:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Vasiliki Kalavri <va...@gmail.com>.

Hi Stephan,

yes, this would work for the cases where an algorithm only updates the
vertex values or only updates the edge values.

What we would like to also support is
(a) algorithms where both vertices and edges are updated in one iteration
(b) algorithms where the graph structure changes from one iteration to the
next and
(c) branching inside an iteration, i.e. executing a different "iteration
body" based on some condition.

We can still currently implement those with regular Flink iteration
operators, but the resulting code is not that nice or efficient.
For example, if we want to update both edges and vertex values, we can
still create a solution set where the vertex values are attached to each
edge.
Regarding branching inside an iteration, we can use an aggregator that
tracks the iteration phase, but then we need to somehow make the different
phases to consist of the same operators and also check the branching
condition inside each UDF.

Cheers,
V.


On 23 February 2015 at 11:05, Stephan Ewen <se...@apache.org> wrote:

> As a workaround, it should always work to get the Edge and Vertex data set
> from the graph and use the regular Fink iteration operators?
>
>
> On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com
> > wrote:
>
> > Hi,
> >
> > yes, I was referring to the parallel Boruvka algorithm. There are several
> > ways to implement this one in Flink and I believe that the one described
> in
> > the paper (vertex-centric) is not the most elegant one :)
> >
> > Andra is now working on an idea that uses the delta iteration abstraction
> > and we believe that it will be both more efficient and easier to
> > understand. It has the edges in the solution set and the vertices in the
> > workset, so it follows the pattern I describe in (2) in my previous
> e-mail.
> > As a next step, we would like to see how having an iteration operator
> that
> > could update the whole graph -what I describe as (3)- would make this
> even
> > nicer.
> >
> > Any ideas are highly welcome!
> >
> > Cheers,
> > V.
> >
> > On 22 February 2015 at 16:32, Andra Lungu <lu...@gmail.com> wrote:
> >
> > > Hi Alex,
> > >
> > > Vasia is talking about the second version(presented Friday) of Parallel
> > > Boruvka, which can be found here:
> > > https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
> > >
> > > I will propose the third, non-Pregel like approach directly to Gelly
> > soon.
> > >
> > > If you have additional questions, I will be happy to answer them.
> > >
> > > Andra
> > >
> > > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov <
> > > alexander.s.alexandrov@gmail.com> wrote:
> > >
> > > > Hi Vasia,
> > > >
> > > > I am trying to look at the problem in more detail. Which version of
> the
> > > MST
> > > > are you talking about?
> > > >
> > > > Right now in the Gelly repository I can only find the SSSP example
> > > > (parallel Bellman-Ford) from Section 4.2 in [1].
> > > >
> > > > However, it seems that the issues encountered by Andra are related to
> > the
> > > > implementation of Parallel Boruvka (Section 3.2 in [2]). Is that
> > correct?
> > > >
> > > > Regards,
> > > > A.
> > > >
> > > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> > > > [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
> > > >
> > > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <
> vasilikikalavri@gmail.com
> > >:
> > > >
> > > > > Hello beautiful Flink people,
> > > > >
> > > > > during the past few days, Andra and I have been discussing about
> how
> > to
> > > > > extend Gelly's iteration methods.
> > > > >
> > > > > Alexander's course (and his awesome students) has made it obvious
> > that
> > > > > vertex-centric iterations are not the best fit for algorithms which
> > > don't
> > > > > follow the common "propagate-update" pattern. For example, Andra is
> > > > working
> > > > > on an implementation of Minimum Spanning Tree, which requires
> > branching
> > > > > inside an iteration and also requires a convergence check of an
> > > internal
> > > > > iteration. Others also reported similar issues [1, 2]. Trying to
> fit
> > > such
> > > > > algorithms to the vertex-centric model leads to long and ugly code,
> > > e.g.
> > > > > aggregators to keep track of algorithm phases, duplicating data,
> etc.
> > > > >
> > > > > One limitation of the vertex-centric and the upcoming GAS model is
> > that
> > > > > they both only allow the vertex values to be updated in each
> > iteration.
> > > > > However, for some algorithms we need to update the edge values and
> in
> > > > > others we need to update both. In even more complex situations
> (like
> > > > > Andra's MST) in some iterations we need to update the vertex values
> > and
> > > > in
> > > > > some iterations we need to update the edge values.
> > > > > Another problem is that we currently don't have a way to allow
> > > different
> > > > > computational phases inside an iteration. This is something that
> > Giraph
> > > > > solves with master compute, a function that is executed once before
> > > each
> > > > > superstep and sets the computation function.
> > > > >
> > > > > All that said, I believe that we can solve most of these issues if
> we
> > > > > nicely expose Flink's iteration operators in Gelly. I can see the
> > > > following
> > > > > cases:
> > > > >
> > > > > 1. Bulk & delta iterations where the solution set is the vertex
> > > dataset:
> > > > > this will be similar to vertex-centric and GAS, but will allow more
> > > > > flexible dataflows inside the iteration.
> > > > > 2. Bulk & delta iterations where the solution set is the edge
> > dataset:
> > > > for
> > > > > the cases where we need to update edge values.
> > > > > 3. Bulk & delta iterations where the solution set is the Graph:
> this
> > > will
> > > > > cover more complex cases, where the algorithm updates both vertices
> > and
> > > > > edges or even adds/removes vertices/edges, i.e. updates the whole
> > > Graph.
> > > > >
> > > > > What do you think? I can see 1 & 2 being very easy to implement,
> but
> > I
> > > > > suspect 3 won't be that easy (but so awesome to have ^^).
> > > > > Would it work the way a Graph is represented now, i.e. with 2
> > DataSets?
> > > > >
> > > > > Any comment, idea, pointer would be much appreciated! Thank you ^^
> > > > >
> > > > > Cheers,
> > > > > -V.
> > > > >
> > > > > [1]:
> > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> > > > > [2]:
> > > > >
> > > > >
> > > >
> > >
> >
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Stephan Ewen <se...@apache.org>.

As a workaround, it should always work to get the Edge and Vertex data set
from the graph and use the regular Fink iteration operators?


On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <vasilikikalavri@gmail.com
> wrote:

> Hi,
>
> yes, I was referring to the parallel Boruvka algorithm. There are several
> ways to implement this one in Flink and I believe that the one described in
> the paper (vertex-centric) is not the most elegant one :)
>
> Andra is now working on an idea that uses the delta iteration abstraction
> and we believe that it will be both more efficient and easier to
> understand. It has the edges in the solution set and the vertices in the
> workset, so it follows the pattern I describe in (2) in my previous e-mail.
> As a next step, we would like to see how having an iteration operator that
> could update the whole graph -what I describe as (3)- would make this even
> nicer.
>
> Any ideas are highly welcome!
>
> Cheers,
> V.
>
> On 22 February 2015 at 16:32, Andra Lungu <lu...@gmail.com> wrote:
>
> > Hi Alex,
> >
> > Vasia is talking about the second version(presented Friday) of Parallel
> > Boruvka, which can be found here:
> > https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
> >
> > I will propose the third, non-Pregel like approach directly to Gelly
> soon.
> >
> > If you have additional questions, I will be happy to answer them.
> >
> > Andra
> >
> > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov <
> > alexander.s.alexandrov@gmail.com> wrote:
> >
> > > Hi Vasia,
> > >
> > > I am trying to look at the problem in more detail. Which version of the
> > MST
> > > are you talking about?
> > >
> > > Right now in the Gelly repository I can only find the SSSP example
> > > (parallel Bellman-Ford) from Section 4.2 in [1].
> > >
> > > However, it seems that the issues encountered by Andra are related to
> the
> > > implementation of Parallel Boruvka (Section 3.2 in [2]). Is that
> correct?
> > >
> > > Regards,
> > > A.
> > >
> > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> > > [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
> > >
> > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <vasilikikalavri@gmail.com
> >:
> > >
> > > > Hello beautiful Flink people,
> > > >
> > > > during the past few days, Andra and I have been discussing about how
> to
> > > > extend Gelly's iteration methods.
> > > >
> > > > Alexander's course (and his awesome students) has made it obvious
> that
> > > > vertex-centric iterations are not the best fit for algorithms which
> > don't
> > > > follow the common "propagate-update" pattern. For example, Andra is
> > > working
> > > > on an implementation of Minimum Spanning Tree, which requires
> branching
> > > > inside an iteration and also requires a convergence check of an
> > internal
> > > > iteration. Others also reported similar issues [1, 2]. Trying to fit
> > such
> > > > algorithms to the vertex-centric model leads to long and ugly code,
> > e.g.
> > > > aggregators to keep track of algorithm phases, duplicating data, etc.
> > > >
> > > > One limitation of the vertex-centric and the upcoming GAS model is
> that
> > > > they both only allow the vertex values to be updated in each
> iteration.
> > > > However, for some algorithms we need to update the edge values and in
> > > > others we need to update both. In even more complex situations (like
> > > > Andra's MST) in some iterations we need to update the vertex values
> and
> > > in
> > > > some iterations we need to update the edge values.
> > > > Another problem is that we currently don't have a way to allow
> > different
> > > > computational phases inside an iteration. This is something that
> Giraph
> > > > solves with master compute, a function that is executed once before
> > each
> > > > superstep and sets the computation function.
> > > >
> > > > All that said, I believe that we can solve most of these issues if we
> > > > nicely expose Flink's iteration operators in Gelly. I can see the
> > > following
> > > > cases:
> > > >
> > > > 1. Bulk & delta iterations where the solution set is the vertex
> > dataset:
> > > > this will be similar to vertex-centric and GAS, but will allow more
> > > > flexible dataflows inside the iteration.
> > > > 2. Bulk & delta iterations where the solution set is the edge
> dataset:
> > > for
> > > > the cases where we need to update edge values.
> > > > 3. Bulk & delta iterations where the solution set is the Graph: this
> > will
> > > > cover more complex cases, where the algorithm updates both vertices
> and
> > > > edges or even adds/removes vertices/edges, i.e. updates the whole
> > Graph.
> > > >
> > > > What do you think? I can see 1 & 2 being very easy to implement, but
> I
> > > > suspect 3 won't be that easy (but so awesome to have ^^).
> > > > Would it work the way a Graph is represented now, i.e. with 2
> DataSets?
> > > >
> > > > Any comment, idea, pointer would be much appreciated! Thank you ^^
> > > >
> > > > Cheers,
> > > > -V.
> > > >
> > > > [1]:
> > > >
> > > >
> > >
> >
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> > > > [2]:
> > > >
> > > >
> > >
> >
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> > > >
> > >
> >
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Vasiliki Kalavri <va...@gmail.com>.

Hi,

yes, I was referring to the parallel Boruvka algorithm. There are several
ways to implement this one in Flink and I believe that the one described in
the paper (vertex-centric) is not the most elegant one :)

Andra is now working on an idea that uses the delta iteration abstraction
and we believe that it will be both more efficient and easier to
understand. It has the edges in the solution set and the vertices in the
workset, so it follows the pattern I describe in (2) in my previous e-mail.
As a next step, we would like to see how having an iteration operator that
could update the whole graph -what I describe as (3)- would make this even
nicer.

Any ideas are highly welcome!

Cheers,
V.

On 22 February 2015 at 16:32, Andra Lungu <lu...@gmail.com> wrote:

> Hi Alex,
>
> Vasia is talking about the second version(presented Friday) of Parallel
> Boruvka, which can be found here:
> https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
>
> I will propose the third, non-Pregel like approach directly to Gelly soon.
>
> If you have additional questions, I will be happy to answer them.
>
> Andra
>
> On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov <
> alexander.s.alexandrov@gmail.com> wrote:
>
> > Hi Vasia,
> >
> > I am trying to look at the problem in more detail. Which version of the
> MST
> > are you talking about?
> >
> > Right now in the Gelly repository I can only find the SSSP example
> > (parallel Bellman-Ford) from Section 4.2 in [1].
> >
> > However, it seems that the issues encountered by Andra are related to the
> > implementation of Parallel Boruvka (Section 3.2 in [2]). Is that correct?
> >
> > Regards,
> > A.
> >
> > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> > [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
> >
> > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <va...@gmail.com>:
> >
> > > Hello beautiful Flink people,
> > >
> > > during the past few days, Andra and I have been discussing about how to
> > > extend Gelly's iteration methods.
> > >
> > > Alexander's course (and his awesome students) has made it obvious that
> > > vertex-centric iterations are not the best fit for algorithms which
> don't
> > > follow the common "propagate-update" pattern. For example, Andra is
> > working
> > > on an implementation of Minimum Spanning Tree, which requires branching
> > > inside an iteration and also requires a convergence check of an
> internal
> > > iteration. Others also reported similar issues [1, 2]. Trying to fit
> such
> > > algorithms to the vertex-centric model leads to long and ugly code,
> e.g.
> > > aggregators to keep track of algorithm phases, duplicating data, etc.
> > >
> > > One limitation of the vertex-centric and the upcoming GAS model is that
> > > they both only allow the vertex values to be updated in each iteration.
> > > However, for some algorithms we need to update the edge values and in
> > > others we need to update both. In even more complex situations (like
> > > Andra's MST) in some iterations we need to update the vertex values and
> > in
> > > some iterations we need to update the edge values.
> > > Another problem is that we currently don't have a way to allow
> different
> > > computational phases inside an iteration. This is something that Giraph
> > > solves with master compute, a function that is executed once before
> each
> > > superstep and sets the computation function.
> > >
> > > All that said, I believe that we can solve most of these issues if we
> > > nicely expose Flink's iteration operators in Gelly. I can see the
> > following
> > > cases:
> > >
> > > 1. Bulk & delta iterations where the solution set is the vertex
> dataset:
> > > this will be similar to vertex-centric and GAS, but will allow more
> > > flexible dataflows inside the iteration.
> > > 2. Bulk & delta iterations where the solution set is the edge dataset:
> > for
> > > the cases where we need to update edge values.
> > > 3. Bulk & delta iterations where the solution set is the Graph: this
> will
> > > cover more complex cases, where the algorithm updates both vertices and
> > > edges or even adds/removes vertices/edges, i.e. updates the whole
> Graph.
> > >
> > > What do you think? I can see 1 & 2 being very easy to implement, but I
> > > suspect 3 won't be that easy (but so awesome to have ^^).
> > > Would it work the way a Graph is represented now, i.e. with 2 DataSets?
> > >
> > > Any comment, idea, pointer would be much appreciated! Thank you ^^
> > >
> > > Cheers,
> > > -V.
> > >
> > > [1]:
> > >
> > >
> >
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> > > [2]:
> > >
> > >
> >
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> > >
> >
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Andra Lungu <lu...@gmail.com>.

Hi Alex,

Vasia is talking about the second version(presented Friday) of Parallel
Boruvka, which can be found here:
https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59

I will propose the third, non-Pregel like approach directly to Gelly soon.

If you have additional questions, I will be happy to answer them.

Andra

On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov <
alexander.s.alexandrov@gmail.com> wrote:

> Hi Vasia,
>
> I am trying to look at the problem in more detail. Which version of the MST
> are you talking about?
>
> Right now in the Gelly repository I can only find the SSSP example
> (parallel Bellman-Ford) from Section 4.2 in [1].
>
> However, it seems that the issues encountered by Andra are related to the
> implementation of Parallel Boruvka (Section 3.2 in [2]). Is that correct?
>
> Regards,
> A.
>
> [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
>
> 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <va...@gmail.com>:
>
> > Hello beautiful Flink people,
> >
> > during the past few days, Andra and I have been discussing about how to
> > extend Gelly's iteration methods.
> >
> > Alexander's course (and his awesome students) has made it obvious that
> > vertex-centric iterations are not the best fit for algorithms which don't
> > follow the common "propagate-update" pattern. For example, Andra is
> working
> > on an implementation of Minimum Spanning Tree, which requires branching
> > inside an iteration and also requires a convergence check of an internal
> > iteration. Others also reported similar issues [1, 2]. Trying to fit such
> > algorithms to the vertex-centric model leads to long and ugly code, e.g.
> > aggregators to keep track of algorithm phases, duplicating data, etc.
> >
> > One limitation of the vertex-centric and the upcoming GAS model is that
> > they both only allow the vertex values to be updated in each iteration.
> > However, for some algorithms we need to update the edge values and in
> > others we need to update both. In even more complex situations (like
> > Andra's MST) in some iterations we need to update the vertex values and
> in
> > some iterations we need to update the edge values.
> > Another problem is that we currently don't have a way to allow different
> > computational phases inside an iteration. This is something that Giraph
> > solves with master compute, a function that is executed once before each
> > superstep and sets the computation function.
> >
> > All that said, I believe that we can solve most of these issues if we
> > nicely expose Flink's iteration operators in Gelly. I can see the
> following
> > cases:
> >
> > 1. Bulk & delta iterations where the solution set is the vertex dataset:
> > this will be similar to vertex-centric and GAS, but will allow more
> > flexible dataflows inside the iteration.
> > 2. Bulk & delta iterations where the solution set is the edge dataset:
> for
> > the cases where we need to update edge values.
> > 3. Bulk & delta iterations where the solution set is the Graph: this will
> > cover more complex cases, where the algorithm updates both vertices and
> > edges or even adds/removes vertices/edges, i.e. updates the whole Graph.
> >
> > What do you think? I can see 1 & 2 being very easy to implement, but I
> > suspect 3 won't be that easy (but so awesome to have ^^).
> > Would it work the way a Graph is represented now, i.e. with 2 DataSets?
> >
> > Any comment, idea, pointer would be much appreciated! Thank you ^^
> >
> > Cheers,
> > -V.
> >
> > [1]:
> >
> >
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> > [2]:
> >
> >
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> >
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Alexander Alexandrov <al...@gmail.com>.

Hi Vasia,

I am trying to look at the problem in more detail. Which version of the MST
are you talking about?

Right now in the Gelly repository I can only find the SSSP example
(parallel Bellman-Ford) from Section 4.2 in [1].

However, it seems that the issues encountered by Andra are related to the
implementation of Parallel Boruvka (Section 3.2 in [2]). Is that correct?

Regards,
A.

[1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
[2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf

2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <va...@gmail.com>:

> Hello beautiful Flink people,
>
> during the past few days, Andra and I have been discussing about how to
> extend Gelly's iteration methods.
>
> Alexander's course (and his awesome students) has made it obvious that
> vertex-centric iterations are not the best fit for algorithms which don't
> follow the common "propagate-update" pattern. For example, Andra is working
> on an implementation of Minimum Spanning Tree, which requires branching
> inside an iteration and also requires a convergence check of an internal
> iteration. Others also reported similar issues [1, 2]. Trying to fit such
> algorithms to the vertex-centric model leads to long and ugly code, e.g.
> aggregators to keep track of algorithm phases, duplicating data, etc.
>
> One limitation of the vertex-centric and the upcoming GAS model is that
> they both only allow the vertex values to be updated in each iteration.
> However, for some algorithms we need to update the edge values and in
> others we need to update both. In even more complex situations (like
> Andra's MST) in some iterations we need to update the vertex values and in
> some iterations we need to update the edge values.
> Another problem is that we currently don't have a way to allow different
> computational phases inside an iteration. This is something that Giraph
> solves with master compute, a function that is executed once before each
> superstep and sets the computation function.
>
> All that said, I believe that we can solve most of these issues if we
> nicely expose Flink's iteration operators in Gelly. I can see the following
> cases:
>
> 1. Bulk & delta iterations where the solution set is the vertex dataset:
> this will be similar to vertex-centric and GAS, but will allow more
> flexible dataflows inside the iteration.
> 2. Bulk & delta iterations where the solution set is the edge dataset: for
> the cases where we need to update edge values.
> 3. Bulk & delta iterations where the solution set is the Graph: this will
> cover more complex cases, where the algorithm updates both vertices and
> edges or even adds/removes vertices/edges, i.e. updates the whole Graph.
>
> What do you think? I can see 1 & 2 being very easy to implement, but I
> suspect 3 won't be that easy (but so awesome to have ^^).
> Would it work the way a Graph is represented now, i.e. with 2 DataSets?
>
> Any comment, idea, pointer would be much appreciated! Thank you ^^
>
> Cheers,
> -V.
>
> [1]:
>
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> [2]:
>
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
>