You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Vasiliki Kalavri <va...@gmail.com> on 2015/03/30 22:33:28 UTC

Re: [DISCUSS] Gelly iteration abstractions

Hi,

I'm back looking into this and I tried out the for-loop approach that was
suggested above.

I implemented a simple algorithm, k-core, which computes the k-core of a
graph by iteratively filtering out vertices with degree less than k.
You can find the code in [1].

Unfortunately, this is giving me the following error, with an example graph
of 10 vertices:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
exceeded
at
org.apache.flink.optimizer.plan.PlanNode.mergeBranchPlanMaps(PlanNode.java:106)
at
org.apache.flink.optimizer.plan.PlanNode.mergeBranchPlanMaps(PlanNode.java:99)
at
org.apache.flink.optimizer.plan.DualInputPlanNode.<init>(DualInputPlanNode.java:90)
at
org.apache.flink.optimizer.plan.DualInputPlanNode.<init>(DualInputPlanNode.java:69)
at
org.apache.flink.optimizer.operators.HashJoinBuildSecondProperties.instantiate(HashJoinBuildSecondProperties.java:81)
at
org.apache.flink.optimizer.dag.TwoInputNode.instantiate(TwoInputNode.java:607)
at
org.apache.flink.optimizer.dag.TwoInputNode.addLocalCandidates(TwoInputNode.java:557)
at
org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:478)
at
org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:309)
at
org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:309)
at
org.apache.flink.optimizer.dag.SingleInputNode.getAlternativePlans(SingleInputNode.java:249)
at
org.apache.flink.optimizer.dag.SingleInputNode.getAlternativePlans(SingleInputNode.java:249)
at
org.apache.flink.optimizer.dag.DataSinkNode.getAlternativePlans(DataSinkNode.java:204)
at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:501)
at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:403)
at org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:162)
at
org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:51)
at
org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:797)
at org.apache.flink.api.java.DataSet.count(DataSet.java:397)
at org.apache.flink.graph.Graph.numberOfVertices(Graph.java:913)
at org.apache.flink.graph.example.KCoreExample.main(KCoreExample.java:91)

Basically, the first 3 iterations are executed normally, then the 4th gets
stuck and eventually the program fails with the above error message.
Any help / suggestion would be greatly appreciated ^^

Cheers,
-Vasia.

[1]:
https://github.com/vasia/flink/commit/62b065b69746e25b74bee84d210d5c5b4ee761bd

On 23 February 2015 at 18:05, Vasiliki Kalavri <va...@gmail.com>
wrote:

> I see, thanks a lot for the answers!
> To rephrase my original question, would it make sense to define a
> closed-loop iteration where the state is the whole graph?
>
> If you want to take a look at the current implementation of DMST using
> delta iteration, Andra has made a PR [1].
> On a high-level, this algorithm does (more or less) the following:
>
> while there are more than one vertices in the graph
>    for every vertex
>       select the min-weight edge
>       add all selected edges to the MST
>    collapse vertices with the same root into one vertex (iterative
> connected components-like step)
>
> What we are thinking is that we could express this with -maybe- a
> bulk-like iteration on the Graph, inside which we can use Gelly methods.
> Would it make sense or shall we go for a for-loop implementation instead?
>
> Thanks!
> -V.
>
> [1]: http://github.com/apache/flink/pull/434
>
> On 23 February 2015 at 16:18, Stephan Ewen <se...@apache.org> wrote:
>
>> Closed-loop iterations are much more efficient right now. Long for loops
>> suffer from memory fragmentation (an issue that is in the list to fix).
>>
>> Also, closed loops can be stateful (delta iterations) and do not require
>> task re-deployment.
>>
>> On Mon, Feb 23, 2015 at 4:15 PM, Vasiliki Kalavri <
>> vasilikikalavri@gmail.com
>> > wrote:
>>
>> > I see that's cool :-)
>> > So, what is the advantage of closed-loop versus for-loop iterations?
>> > Custom convergence criteria / aggregators and more efficient execution
>> > plans?
>> >
>> > On 23 February 2015 at 15:01, Stephan Ewen <se...@apache.org> wrote:
>> >
>> > > For loops are basically rolled out - they yield long execution plans.
>> > >
>> > > On Mon, Feb 23, 2015 at 2:44 PM, Vasiliki Kalavri <
>> > > vasilikikalavri@gmail.com
>> > > > wrote:
>> > >
>> > > > for-loop iterations could cover some cases, I guess, when the
>> number of
>> > > > iterations is known beforehand.
>> > > > Are there currently any restrictions on what can be used inside a
>> > > for-loop?
>> > > > How are they translated into execution plans?
>> > > >
>> > > > On 23 February 2015 at 13:08, Stephan Ewen <se...@apache.org>
>> wrote:
>> > > >
>> > > > > Some things may not work well as "closed-loop" iterations.
>> > > > >
>> > > > > Is it possible to express those as for-loop iterations?
>> > > > >
>> > > > > On Mon, Feb 23, 2015 at 1:03 PM, Vasiliki Kalavri <
>> > > > > vasilikikalavri@gmail.com
>> > > > > > wrote:
>> > > > >
>> > > > > > Hi Stephan,
>> > > > > >
>> > > > > > yes, this would work for the cases where an algorithm only
>> updates
>> > > the
>> > > > > > vertex values or only updates the edge values.
>> > > > > >
>> > > > > > What we would like to also support is
>> > > > > > (a) algorithms where both vertices and edges are updated in one
>> > > > iteration
>> > > > > > (b) algorithms where the graph structure changes from one
>> iteration
>> > > to
>> > > > > the
>> > > > > > next and
>> > > > > > (c) branching inside an iteration, i.e. executing a different
>> > > > "iteration
>> > > > > > body" based on some condition.
>> > > > > >
>> > > > > > We can still currently implement those with regular Flink
>> iteration
>> > > > > > operators, but the resulting code is not that nice or efficient.
>> > > > > > For example, if we want to update both edges and vertex values,
>> we
>> > > can
>> > > > > > still create a solution set where the vertex values are
>> attached to
>> > > > each
>> > > > > > edge.
>> > > > > > Regarding branching inside an iteration, we can use an
>> aggregator
>> > > that
>> > > > > > tracks the iteration phase, but then we need to somehow make the
>> > > > > different
>> > > > > > phases to consist of the same operators and also check the
>> > branching
>> > > > > > condition inside each UDF.
>> > > > > >
>> > > > > > Cheers,
>> > > > > > V.
>> > > > > >
>> > > > > >
>> > > > > > On 23 February 2015 at 11:05, Stephan Ewen <se...@apache.org>
>> > wrote:
>> > > > > >
>> > > > > > > As a workaround, it should always work to get the Edge and
>> Vertex
>> > > > data
>> > > > > > set
>> > > > > > > from the graph and use the regular Fink iteration operators?
>> > > > > > >
>> > > > > > >
>> > > > > > > On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <
>> > > > > > > vasilikikalavri@gmail.com
>> > > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi,
>> > > > > > > >
>> > > > > > > > yes, I was referring to the parallel Boruvka algorithm.
>> There
>> > are
>> > > > > > several
>> > > > > > > > ways to implement this one in Flink and I believe that the
>> one
>> > > > > > described
>> > > > > > > in
>> > > > > > > > the paper (vertex-centric) is not the most elegant one :)
>> > > > > > > >
>> > > > > > > > Andra is now working on an idea that uses the delta
>> iteration
>> > > > > > abstraction
>> > > > > > > > and we believe that it will be both more efficient and
>> easier
>> > to
>> > > > > > > > understand. It has the edges in the solution set and the
>> > vertices
>> > > > in
>> > > > > > the
>> > > > > > > > workset, so it follows the pattern I describe in (2) in my
>> > > previous
>> > > > > > > e-mail.
>> > > > > > > > As a next step, we would like to see how having an iteration
>> > > > operator
>> > > > > > > that
>> > > > > > > > could update the whole graph -what I describe as (3)- would
>> > make
>> > > > this
>> > > > > > > even
>> > > > > > > > nicer.
>> > > > > > > >
>> > > > > > > > Any ideas are highly welcome!
>> > > > > > > >
>> > > > > > > > Cheers,
>> > > > > > > > V.
>> > > > > > > >
>> > > > > > > > On 22 February 2015 at 16:32, Andra Lungu <
>> > lungu.andra@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi Alex,
>> > > > > > > > >
>> > > > > > > > > Vasia is talking about the second version(presented
>> Friday)
>> > of
>> > > > > > Parallel
>> > > > > > > > > Boruvka, which can be found here:
>> > > > > > > > > https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
>> > > > > > > > >
>> > > > > > > > > I will propose the third, non-Pregel like approach
>> directly
>> > to
>> > > > > Gelly
>> > > > > > > > soon.
>> > > > > > > > >
>> > > > > > > > > If you have additional questions, I will be happy to
>> answer
>> > > them.
>> > > > > > > > >
>> > > > > > > > > Andra
>> > > > > > > > >
>> > > > > > > > > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov <
>> > > > > > > > > alexander.s.alexandrov@gmail.com> wrote:
>> > > > > > > > >
>> > > > > > > > > > Hi Vasia,
>> > > > > > > > > >
>> > > > > > > > > > I am trying to look at the problem in more detail. Which
>> > > > version
>> > > > > of
>> > > > > > > the
>> > > > > > > > > MST
>> > > > > > > > > > are you talking about?
>> > > > > > > > > >
>> > > > > > > > > > Right now in the Gelly repository I can only find the
>> SSSP
>> > > > > example
>> > > > > > > > > > (parallel Bellman-Ford) from Section 4.2 in [1].
>> > > > > > > > > >
>> > > > > > > > > > However, it seems that the issues encountered by Andra
>> are
>> > > > > related
>> > > > > > to
>> > > > > > > > the
>> > > > > > > > > > implementation of Parallel Boruvka (Section 3.2 in
>> [2]). Is
>> > > > that
>> > > > > > > > correct?
>> > > > > > > > > >
>> > > > > > > > > > Regards,
>> > > > > > > > > > A.
>> > > > > > > > > >
>> > > > > > > > > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
>> > > > > > > > > > [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
>> > > > > > > > > >
>> > > > > > > > > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <
>> > > > > > > vasilikikalavri@gmail.com
>> > > > > > > > >:
>> > > > > > > > > >
>> > > > > > > > > > > Hello beautiful Flink people,
>> > > > > > > > > > >
>> > > > > > > > > > > during the past few days, Andra and I have been
>> > discussing
>> > > > > about
>> > > > > > > how
>> > > > > > > > to
>> > > > > > > > > > > extend Gelly's iteration methods.
>> > > > > > > > > > >
>> > > > > > > > > > > Alexander's course (and his awesome students) has
>> made it
>> > > > > obvious
>> > > > > > > > that
>> > > > > > > > > > > vertex-centric iterations are not the best fit for
>> > > algorithms
>> > > > > > which
>> > > > > > > > > don't
>> > > > > > > > > > > follow the common "propagate-update" pattern. For
>> > example,
>> > > > > Andra
>> > > > > > is
>> > > > > > > > > > working
>> > > > > > > > > > > on an implementation of Minimum Spanning Tree, which
>> > > requires
>> > > > > > > > branching
>> > > > > > > > > > > inside an iteration and also requires a convergence
>> check
>> > > of
>> > > > an
>> > > > > > > > > internal
>> > > > > > > > > > > iteration. Others also reported similar issues [1, 2].
>> > > Trying
>> > > > > to
>> > > > > > > fit
>> > > > > > > > > such
>> > > > > > > > > > > algorithms to the vertex-centric model leads to long
>> and
>> > > ugly
>> > > > > > code,
>> > > > > > > > > e.g.
>> > > > > > > > > > > aggregators to keep track of algorithm phases,
>> > duplicating
>> > > > > data,
>> > > > > > > etc.
>> > > > > > > > > > >
>> > > > > > > > > > > One limitation of the vertex-centric and the upcoming
>> GAS
>> > > > model
>> > > > > > is
>> > > > > > > > that
>> > > > > > > > > > > they both only allow the vertex values to be updated
>> in
>> > > each
>> > > > > > > > iteration.
>> > > > > > > > > > > However, for some algorithms we need to update the
>> edge
>> > > > values
>> > > > > > and
>> > > > > > > in
>> > > > > > > > > > > others we need to update both. In even more complex
>> > > > situations
>> > > > > > > (like
>> > > > > > > > > > > Andra's MST) in some iterations we need to update the
>> > > vertex
>> > > > > > values
>> > > > > > > > and
>> > > > > > > > > > in
>> > > > > > > > > > > some iterations we need to update the edge values.
>> > > > > > > > > > > Another problem is that we currently don't have a way
>> to
>> > > > allow
>> > > > > > > > > different
>> > > > > > > > > > > computational phases inside an iteration. This is
>> > something
>> > > > > that
>> > > > > > > > Giraph
>> > > > > > > > > > > solves with master compute, a function that is
>> executed
>> > > once
>> > > > > > before
>> > > > > > > > > each
>> > > > > > > > > > > superstep and sets the computation function.
>> > > > > > > > > > >
>> > > > > > > > > > > All that said, I believe that we can solve most of
>> these
>> > > > issues
>> > > > > > if
>> > > > > > > we
>> > > > > > > > > > > nicely expose Flink's iteration operators in Gelly. I
>> can
>> > > see
>> > > > > the
>> > > > > > > > > > following
>> > > > > > > > > > > cases:
>> > > > > > > > > > >
>> > > > > > > > > > > 1. Bulk & delta iterations where the solution set is
>> the
>> > > > vertex
>> > > > > > > > > dataset:
>> > > > > > > > > > > this will be similar to vertex-centric and GAS, but
>> will
>> > > > allow
>> > > > > > more
>> > > > > > > > > > > flexible dataflows inside the iteration.
>> > > > > > > > > > > 2. Bulk & delta iterations where the solution set is
>> the
>> > > edge
>> > > > > > > > dataset:
>> > > > > > > > > > for
>> > > > > > > > > > > the cases where we need to update edge values.
>> > > > > > > > > > > 3. Bulk & delta iterations where the solution set is
>> the
>> > > > Graph:
>> > > > > > > this
>> > > > > > > > > will
>> > > > > > > > > > > cover more complex cases, where the algorithm updates
>> > both
>> > > > > > vertices
>> > > > > > > > and
>> > > > > > > > > > > edges or even adds/removes vertices/edges, i.e.
>> updates
>> > the
>> > > > > whole
>> > > > > > > > > Graph.
>> > > > > > > > > > >
>> > > > > > > > > > > What do you think? I can see 1 & 2 being very easy to
>> > > > > implement,
>> > > > > > > but
>> > > > > > > > I
>> > > > > > > > > > > suspect 3 won't be that easy (but so awesome to have
>> ^^).
>> > > > > > > > > > > Would it work the way a Graph is represented now, i.e.
>> > > with 2
>> > > > > > > > DataSets?
>> > > > > > > > > > >
>> > > > > > > > > > > Any comment, idea, pointer would be much appreciated!
>> > Thank
>> > > > you
>> > > > > > ^^
>> > > > > > > > > > >
>> > > > > > > > > > > Cheers,
>> > > > > > > > > > > -V.
>> > > > > > > > > > >
>> > > > > > > > > > > [1]:
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
>> > > > > > > > > > > [2]:
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Vasiliki Kalavri <va...@gmail.com>.
Hey,

Wow, Vasia, a GC limit error in the Optimizer (pre-flight), not the runtime.
>
> That is cannot have been easy to produce ;-) The program does not really
> look that tricky, I need to take some closer look at this...
>

​It was pretty easy actually :P
The program is really simple: in each iteration, compute the vertex
degrees, filter out the ones that have degree less than K and emit the
resulting graph.


>
> BTW: Before this algorithm runs well, we really need pinning intermediate
> results in memory. Otherwise each step of the loop will execute its entire
> history every time again.


​Can you elaborate a bit on this? Is there some ongoing work on pinning
intermediate results in memory?
And can I maybe help somehow? :-)​


> A way to mitigate this for now is to manually
> break persist and re-read the intermediate graph.


​I wouldn't want to mitigate this issue in a hacky way, but rather have a
nice and clean solution to allow writing such iterative programs, where a
graph is transformed inside an iteration using Gelly methods and then fed
as input to the​ next iteration.


> The ML library does that
> in some points to break programs into shorter, more feasible portions.
>

​Could you point me to an example where this is ​happening?


>
> Also: Do you cache the vertex count in the graph? May be worth doing,
> otherwise all counts are computed twice (for current graph and for the
> previous graph)
>
>
​That's a valid point, I can cache the previous count. But even without
computing the degrees and just running the loop for a f​ixed number of
iterations results in the same error :/



> Stephan
>
>
​Thanks a lot!

-V.​



>
> On Mon, Mar 30, 2015 at 10:33 PM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com> wrote:
>
> > Hi,
> >
> > I'm back looking into this and I tried out the for-loop approach that was
> > suggested above.
> >
> > I implemented a simple algorithm, k-core, which computes the k-core of a
> > graph by iteratively filtering out vertices with degree less than k.
> > You can find the code in [1].
> >
> > Unfortunately, this is giving me the following error, with an example
> graph
> > of 10 vertices:
> >
> > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
> > exceeded
> > at
> >
> >
> org.apache.flink.optimizer.plan.PlanNode.mergeBranchPlanMaps(PlanNode.java:106)
> > at
> >
> >
> org.apache.flink.optimizer.plan.PlanNode.mergeBranchPlanMaps(PlanNode.java:99)
> > at
> >
> >
> org.apache.flink.optimizer.plan.DualInputPlanNode.<init>(DualInputPlanNode.java:90)
> > at
> >
> >
> org.apache.flink.optimizer.plan.DualInputPlanNode.<init>(DualInputPlanNode.java:69)
> > at
> >
> >
> org.apache.flink.optimizer.operators.HashJoinBuildSecondProperties.instantiate(HashJoinBuildSecondProperties.java:81)
> > at
> >
> >
> org.apache.flink.optimizer.dag.TwoInputNode.instantiate(TwoInputNode.java:607)
> > at
> >
> >
> org.apache.flink.optimizer.dag.TwoInputNode.addLocalCandidates(TwoInputNode.java:557)
> > at
> >
> >
> org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:478)
> > at
> >
> >
> org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:309)
> > at
> >
> >
> org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:309)
> > at
> >
> >
> org.apache.flink.optimizer.dag.SingleInputNode.getAlternativePlans(SingleInputNode.java:249)
> > at
> >
> >
> org.apache.flink.optimizer.dag.SingleInputNode.getAlternativePlans(SingleInputNode.java:249)
> > at
> >
> >
> org.apache.flink.optimizer.dag.DataSinkNode.getAlternativePlans(DataSinkNode.java:204)
> > at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:501)
> > at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:403)
> > at
> > org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:162)
> > at
> >
> >
> org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:51)
> > at
> >
> >
> org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:797)
> > at org.apache.flink.api.java.DataSet.count(DataSet.java:397)
> > at org.apache.flink.graph.Graph.numberOfVertices(Graph.java:913)
> > at org.apache.flink.graph.example.KCoreExample.main(KCoreExample.java:91)
> >
> > Basically, the first 3 iterations are executed normally, then the 4th
> gets
> > stuck and eventually the program fails with the above error message.
> > Any help / suggestion would be greatly appreciated ^^
> >
> > Cheers,
> > -Vasia.
> >
> > [1]:
> >
> >
> https://github.com/vasia/flink/commit/62b065b69746e25b74bee84d210d5c5b4ee761bd
> >
> > On 23 February 2015 at 18:05, Vasiliki Kalavri <
> vasilikikalavri@gmail.com>
> > wrote:
> >
> > > I see, thanks a lot for the answers!
> > > To rephrase my original question, would it make sense to define a
> > > closed-loop iteration where the state is the whole graph?
> > >
> > > If you want to take a look at the current implementation of DMST using
> > > delta iteration, Andra has made a PR [1].
> > > On a high-level, this algorithm does (more or less) the following:
> > >
> > > while there are more than one vertices in the graph
> > >    for every vertex
> > >       select the min-weight edge
> > >       add all selected edges to the MST
> > >    collapse vertices with the same root into one vertex (iterative
> > > connected components-like step)
> > >
> > > What we are thinking is that we could express this with -maybe- a
> > > bulk-like iteration on the Graph, inside which we can use Gelly
> methods.
> > > Would it make sense or shall we go for a for-loop implementation
> instead?
> > >
> > > Thanks!
> > > -V.
> > >
> > > [1]: http://github.com/apache/flink/pull/434
> > >
> > > On 23 February 2015 at 16:18, Stephan Ewen <se...@apache.org> wrote:
> > >
> > >> Closed-loop iterations are much more efficient right now. Long for
> loops
> > >> suffer from memory fragmentation (an issue that is in the list to
> fix).
> > >>
> > >> Also, closed loops can be stateful (delta iterations) and do not
> require
> > >> task re-deployment.
> > >>
> > >> On Mon, Feb 23, 2015 at 4:15 PM, Vasiliki Kalavri <
> > >> vasilikikalavri@gmail.com
> > >> > wrote:
> > >>
> > >> > I see that's cool :-)
> > >> > So, what is the advantage of closed-loop versus for-loop iterations?
> > >> > Custom convergence criteria / aggregators and more efficient
> execution
> > >> > plans?
> > >> >
> > >> > On 23 February 2015 at 15:01, Stephan Ewen <se...@apache.org>
> wrote:
> > >> >
> > >> > > For loops are basically rolled out - they yield long execution
> > plans.
> > >> > >
> > >> > > On Mon, Feb 23, 2015 at 2:44 PM, Vasiliki Kalavri <
> > >> > > vasilikikalavri@gmail.com
> > >> > > > wrote:
> > >> > >
> > >> > > > for-loop iterations could cover some cases, I guess, when the
> > >> number of
> > >> > > > iterations is known beforehand.
> > >> > > > Are there currently any restrictions on what can be used inside
> a
> > >> > > for-loop?
> > >> > > > How are they translated into execution plans?
> > >> > > >
> > >> > > > On 23 February 2015 at 13:08, Stephan Ewen <se...@apache.org>
> > >> wrote:
> > >> > > >
> > >> > > > > Some things may not work well as "closed-loop" iterations.
> > >> > > > >
> > >> > > > > Is it possible to express those as for-loop iterations?
> > >> > > > >
> > >> > > > > On Mon, Feb 23, 2015 at 1:03 PM, Vasiliki Kalavri <
> > >> > > > > vasilikikalavri@gmail.com
> > >> > > > > > wrote:
> > >> > > > >
> > >> > > > > > Hi Stephan,
> > >> > > > > >
> > >> > > > > > yes, this would work for the cases where an algorithm only
> > >> updates
> > >> > > the
> > >> > > > > > vertex values or only updates the edge values.
> > >> > > > > >
> > >> > > > > > What we would like to also support is
> > >> > > > > > (a) algorithms where both vertices and edges are updated in
> > one
> > >> > > > iteration
> > >> > > > > > (b) algorithms where the graph structure changes from one
> > >> iteration
> > >> > > to
> > >> > > > > the
> > >> > > > > > next and
> > >> > > > > > (c) branching inside an iteration, i.e. executing a
> different
> > >> > > > "iteration
> > >> > > > > > body" based on some condition.
> > >> > > > > >
> > >> > > > > > We can still currently implement those with regular Flink
> > >> iteration
> > >> > > > > > operators, but the resulting code is not that nice or
> > efficient.
> > >> > > > > > For example, if we want to update both edges and vertex
> > values,
> > >> we
> > >> > > can
> > >> > > > > > still create a solution set where the vertex values are
> > >> attached to
> > >> > > > each
> > >> > > > > > edge.
> > >> > > > > > Regarding branching inside an iteration, we can use an
> > >> aggregator
> > >> > > that
> > >> > > > > > tracks the iteration phase, but then we need to somehow make
> > the
> > >> > > > > different
> > >> > > > > > phases to consist of the same operators and also check the
> > >> > branching
> > >> > > > > > condition inside each UDF.
> > >> > > > > >
> > >> > > > > > Cheers,
> > >> > > > > > V.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On 23 February 2015 at 11:05, Stephan Ewen <
> sewen@apache.org>
> > >> > wrote:
> > >> > > > > >
> > >> > > > > > > As a workaround, it should always work to get the Edge and
> > >> Vertex
> > >> > > > data
> > >> > > > > > set
> > >> > > > > > > from the graph and use the regular Fink iteration
> operators?
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <
> > >> > > > > > > vasilikikalavri@gmail.com
> > >> > > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi,
> > >> > > > > > > >
> > >> > > > > > > > yes, I was referring to the parallel Boruvka algorithm.
> > >> There
> > >> > are
> > >> > > > > > several
> > >> > > > > > > > ways to implement this one in Flink and I believe that
> the
> > >> one
> > >> > > > > > described
> > >> > > > > > > in
> > >> > > > > > > > the paper (vertex-centric) is not the most elegant one
> :)
> > >> > > > > > > >
> > >> > > > > > > > Andra is now working on an idea that uses the delta
> > >> iteration
> > >> > > > > > abstraction
> > >> > > > > > > > and we believe that it will be both more efficient and
> > >> easier
> > >> > to
> > >> > > > > > > > understand. It has the edges in the solution set and the
> > >> > vertices
> > >> > > > in
> > >> > > > > > the
> > >> > > > > > > > workset, so it follows the pattern I describe in (2) in
> my
> > >> > > previous
> > >> > > > > > > e-mail.
> > >> > > > > > > > As a next step, we would like to see how having an
> > iteration
> > >> > > > operator
> > >> > > > > > > that
> > >> > > > > > > > could update the whole graph -what I describe as (3)-
> > would
> > >> > make
> > >> > > > this
> > >> > > > > > > even
> > >> > > > > > > > nicer.
> > >> > > > > > > >
> > >> > > > > > > > Any ideas are highly welcome!
> > >> > > > > > > >
> > >> > > > > > > > Cheers,
> > >> > > > > > > > V.
> > >> > > > > > > >
> > >> > > > > > > > On 22 February 2015 at 16:32, Andra Lungu <
> > >> > lungu.andra@gmail.com
> > >> > > >
> > >> > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Hi Alex,
> > >> > > > > > > > >
> > >> > > > > > > > > Vasia is talking about the second version(presented
> > >> Friday)
> > >> > of
> > >> > > > > > Parallel
> > >> > > > > > > > > Boruvka, which can be found here:
> > >> > > > > > > > >
> https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
> > >> > > > > > > > >
> > >> > > > > > > > > I will propose the third, non-Pregel like approach
> > >> directly
> > >> > to
> > >> > > > > Gelly
> > >> > > > > > > > soon.
> > >> > > > > > > > >
> > >> > > > > > > > > If you have additional questions, I will be happy to
> > >> answer
> > >> > > them.
> > >> > > > > > > > >
> > >> > > > > > > > > Andra
> > >> > > > > > > > >
> > >> > > > > > > > > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov
> <
> > >> > > > > > > > > alexander.s.alexandrov@gmail.com> wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > Hi Vasia,
> > >> > > > > > > > > >
> > >> > > > > > > > > > I am trying to look at the problem in more detail.
> > Which
> > >> > > > version
> > >> > > > > of
> > >> > > > > > > the
> > >> > > > > > > > > MST
> > >> > > > > > > > > > are you talking about?
> > >> > > > > > > > > >
> > >> > > > > > > > > > Right now in the Gelly repository I can only find
> the
> > >> SSSP
> > >> > > > > example
> > >> > > > > > > > > > (parallel Bellman-Ford) from Section 4.2 in [1].
> > >> > > > > > > > > >
> > >> > > > > > > > > > However, it seems that the issues encountered by
> Andra
> > >> are
> > >> > > > > related
> > >> > > > > > to
> > >> > > > > > > > the
> > >> > > > > > > > > > implementation of Parallel Boruvka (Section 3.2 in
> > >> [2]). Is
> > >> > > > that
> > >> > > > > > > > correct?
> > >> > > > > > > > > >
> > >> > > > > > > > > > Regards,
> > >> > > > > > > > > > A.
> > >> > > > > > > > > >
> > >> > > > > > > > > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> > >> > > > > > > > > > [2]
> http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
> > >> > > > > > > > > >
> > >> > > > > > > > > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <
> > >> > > > > > > vasilikikalavri@gmail.com
> > >> > > > > > > > >:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > Hello beautiful Flink people,
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > during the past few days, Andra and I have been
> > >> > discussing
> > >> > > > > about
> > >> > > > > > > how
> > >> > > > > > > > to
> > >> > > > > > > > > > > extend Gelly's iteration methods.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Alexander's course (and his awesome students) has
> > >> made it
> > >> > > > > obvious
> > >> > > > > > > > that
> > >> > > > > > > > > > > vertex-centric iterations are not the best fit for
> > >> > > algorithms
> > >> > > > > > which
> > >> > > > > > > > > don't
> > >> > > > > > > > > > > follow the common "propagate-update" pattern. For
> > >> > example,
> > >> > > > > Andra
> > >> > > > > > is
> > >> > > > > > > > > > working
> > >> > > > > > > > > > > on an implementation of Minimum Spanning Tree,
> which
> > >> > > requires
> > >> > > > > > > > branching
> > >> > > > > > > > > > > inside an iteration and also requires a
> convergence
> > >> check
> > >> > > of
> > >> > > > an
> > >> > > > > > > > > internal
> > >> > > > > > > > > > > iteration. Others also reported similar issues [1,
> > 2].
> > >> > > Trying
> > >> > > > > to
> > >> > > > > > > fit
> > >> > > > > > > > > such
> > >> > > > > > > > > > > algorithms to the vertex-centric model leads to
> long
> > >> and
> > >> > > ugly
> > >> > > > > > code,
> > >> > > > > > > > > e.g.
> > >> > > > > > > > > > > aggregators to keep track of algorithm phases,
> > >> > duplicating
> > >> > > > > data,
> > >> > > > > > > etc.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > One limitation of the vertex-centric and the
> > upcoming
> > >> GAS
> > >> > > > model
> > >> > > > > > is
> > >> > > > > > > > that
> > >> > > > > > > > > > > they both only allow the vertex values to be
> updated
> > >> in
> > >> > > each
> > >> > > > > > > > iteration.
> > >> > > > > > > > > > > However, for some algorithms we need to update the
> > >> edge
> > >> > > > values
> > >> > > > > > and
> > >> > > > > > > in
> > >> > > > > > > > > > > others we need to update both. In even more
> complex
> > >> > > > situations
> > >> > > > > > > (like
> > >> > > > > > > > > > > Andra's MST) in some iterations we need to update
> > the
> > >> > > vertex
> > >> > > > > > values
> > >> > > > > > > > and
> > >> > > > > > > > > > in
> > >> > > > > > > > > > > some iterations we need to update the edge values.
> > >> > > > > > > > > > > Another problem is that we currently don't have a
> > way
> > >> to
> > >> > > > allow
> > >> > > > > > > > > different
> > >> > > > > > > > > > > computational phases inside an iteration. This is
> > >> > something
> > >> > > > > that
> > >> > > > > > > > Giraph
> > >> > > > > > > > > > > solves with master compute, a function that is
> > >> executed
> > >> > > once
> > >> > > > > > before
> > >> > > > > > > > > each
> > >> > > > > > > > > > > superstep and sets the computation function.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > All that said, I believe that we can solve most of
> > >> these
> > >> > > > issues
> > >> > > > > > if
> > >> > > > > > > we
> > >> > > > > > > > > > > nicely expose Flink's iteration operators in
> Gelly.
> > I
> > >> can
> > >> > > see
> > >> > > > > the
> > >> > > > > > > > > > following
> > >> > > > > > > > > > > cases:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > 1. Bulk & delta iterations where the solution set
> is
> > >> the
> > >> > > > vertex
> > >> > > > > > > > > dataset:
> > >> > > > > > > > > > > this will be similar to vertex-centric and GAS,
> but
> > >> will
> > >> > > > allow
> > >> > > > > > more
> > >> > > > > > > > > > > flexible dataflows inside the iteration.
> > >> > > > > > > > > > > 2. Bulk & delta iterations where the solution set
> is
> > >> the
> > >> > > edge
> > >> > > > > > > > dataset:
> > >> > > > > > > > > > for
> > >> > > > > > > > > > > the cases where we need to update edge values.
> > >> > > > > > > > > > > 3. Bulk & delta iterations where the solution set
> is
> > >> the
> > >> > > > Graph:
> > >> > > > > > > this
> > >> > > > > > > > > will
> > >> > > > > > > > > > > cover more complex cases, where the algorithm
> > updates
> > >> > both
> > >> > > > > > vertices
> > >> > > > > > > > and
> > >> > > > > > > > > > > edges or even adds/removes vertices/edges, i.e.
> > >> updates
> > >> > the
> > >> > > > > whole
> > >> > > > > > > > > Graph.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > What do you think? I can see 1 & 2 being very easy
> > to
> > >> > > > > implement,
> > >> > > > > > > but
> > >> > > > > > > > I
> > >> > > > > > > > > > > suspect 3 won't be that easy (but so awesome to
> have
> > >> ^^).
> > >> > > > > > > > > > > Would it work the way a Graph is represented now,
> > i.e.
> > >> > > with 2
> > >> > > > > > > > DataSets?
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Any comment, idea, pointer would be much
> > appreciated!
> > >> > Thank
> > >> > > > you
> > >> > > > > > ^^
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Cheers,
> > >> > > > > > > > > > > -V.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > [1]:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> > >> > > > > > > > > > > [2]:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: [DISCUSS] Gelly iteration abstractions

Posted by Stephan Ewen <se...@apache.org>.
Wow, Vasia, a GC limit error in the Optimizer (pre-flight), not the runtime.

That is cannot have been easy to produce ;-) The program does not really
look that tricky, I need to take some closer look at this...

BTW: Before this algorithm runs well, we really need pinning intermediate
results in memory. Otherwise each step of the loop will execute its entire
history every time again. A way to mitigate this for now is to manually
break persist and re-read the intermediate graph. The ML library does that
in some points to break programs into shorter, more feasible portions.

Also: Do you cache the vertex count in the graph? May be worth doing,
otherwise all counts are computed twice (for current graph and for the
previous graph)

Stephan


On Mon, Mar 30, 2015 at 10:33 PM, Vasiliki Kalavri <
vasilikikalavri@gmail.com> wrote:

> Hi,
>
> I'm back looking into this and I tried out the for-loop approach that was
> suggested above.
>
> I implemented a simple algorithm, k-core, which computes the k-core of a
> graph by iteratively filtering out vertices with degree less than k.
> You can find the code in [1].
>
> Unfortunately, this is giving me the following error, with an example graph
> of 10 vertices:
>
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
> exceeded
> at
>
> org.apache.flink.optimizer.plan.PlanNode.mergeBranchPlanMaps(PlanNode.java:106)
> at
>
> org.apache.flink.optimizer.plan.PlanNode.mergeBranchPlanMaps(PlanNode.java:99)
> at
>
> org.apache.flink.optimizer.plan.DualInputPlanNode.<init>(DualInputPlanNode.java:90)
> at
>
> org.apache.flink.optimizer.plan.DualInputPlanNode.<init>(DualInputPlanNode.java:69)
> at
>
> org.apache.flink.optimizer.operators.HashJoinBuildSecondProperties.instantiate(HashJoinBuildSecondProperties.java:81)
> at
>
> org.apache.flink.optimizer.dag.TwoInputNode.instantiate(TwoInputNode.java:607)
> at
>
> org.apache.flink.optimizer.dag.TwoInputNode.addLocalCandidates(TwoInputNode.java:557)
> at
>
> org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:478)
> at
>
> org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:309)
> at
>
> org.apache.flink.optimizer.dag.TwoInputNode.getAlternativePlans(TwoInputNode.java:309)
> at
>
> org.apache.flink.optimizer.dag.SingleInputNode.getAlternativePlans(SingleInputNode.java:249)
> at
>
> org.apache.flink.optimizer.dag.SingleInputNode.getAlternativePlans(SingleInputNode.java:249)
> at
>
> org.apache.flink.optimizer.dag.DataSinkNode.getAlternativePlans(DataSinkNode.java:204)
> at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:501)
> at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:403)
> at
> org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:162)
> at
>
> org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:51)
> at
>
> org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:797)
> at org.apache.flink.api.java.DataSet.count(DataSet.java:397)
> at org.apache.flink.graph.Graph.numberOfVertices(Graph.java:913)
> at org.apache.flink.graph.example.KCoreExample.main(KCoreExample.java:91)
>
> Basically, the first 3 iterations are executed normally, then the 4th gets
> stuck and eventually the program fails with the above error message.
> Any help / suggestion would be greatly appreciated ^^
>
> Cheers,
> -Vasia.
>
> [1]:
>
> https://github.com/vasia/flink/commit/62b065b69746e25b74bee84d210d5c5b4ee761bd
>
> On 23 February 2015 at 18:05, Vasiliki Kalavri <va...@gmail.com>
> wrote:
>
> > I see, thanks a lot for the answers!
> > To rephrase my original question, would it make sense to define a
> > closed-loop iteration where the state is the whole graph?
> >
> > If you want to take a look at the current implementation of DMST using
> > delta iteration, Andra has made a PR [1].
> > On a high-level, this algorithm does (more or less) the following:
> >
> > while there are more than one vertices in the graph
> >    for every vertex
> >       select the min-weight edge
> >       add all selected edges to the MST
> >    collapse vertices with the same root into one vertex (iterative
> > connected components-like step)
> >
> > What we are thinking is that we could express this with -maybe- a
> > bulk-like iteration on the Graph, inside which we can use Gelly methods.
> > Would it make sense or shall we go for a for-loop implementation instead?
> >
> > Thanks!
> > -V.
> >
> > [1]: http://github.com/apache/flink/pull/434
> >
> > On 23 February 2015 at 16:18, Stephan Ewen <se...@apache.org> wrote:
> >
> >> Closed-loop iterations are much more efficient right now. Long for loops
> >> suffer from memory fragmentation (an issue that is in the list to fix).
> >>
> >> Also, closed loops can be stateful (delta iterations) and do not require
> >> task re-deployment.
> >>
> >> On Mon, Feb 23, 2015 at 4:15 PM, Vasiliki Kalavri <
> >> vasilikikalavri@gmail.com
> >> > wrote:
> >>
> >> > I see that's cool :-)
> >> > So, what is the advantage of closed-loop versus for-loop iterations?
> >> > Custom convergence criteria / aggregators and more efficient execution
> >> > plans?
> >> >
> >> > On 23 February 2015 at 15:01, Stephan Ewen <se...@apache.org> wrote:
> >> >
> >> > > For loops are basically rolled out - they yield long execution
> plans.
> >> > >
> >> > > On Mon, Feb 23, 2015 at 2:44 PM, Vasiliki Kalavri <
> >> > > vasilikikalavri@gmail.com
> >> > > > wrote:
> >> > >
> >> > > > for-loop iterations could cover some cases, I guess, when the
> >> number of
> >> > > > iterations is known beforehand.
> >> > > > Are there currently any restrictions on what can be used inside a
> >> > > for-loop?
> >> > > > How are they translated into execution plans?
> >> > > >
> >> > > > On 23 February 2015 at 13:08, Stephan Ewen <se...@apache.org>
> >> wrote:
> >> > > >
> >> > > > > Some things may not work well as "closed-loop" iterations.
> >> > > > >
> >> > > > > Is it possible to express those as for-loop iterations?
> >> > > > >
> >> > > > > On Mon, Feb 23, 2015 at 1:03 PM, Vasiliki Kalavri <
> >> > > > > vasilikikalavri@gmail.com
> >> > > > > > wrote:
> >> > > > >
> >> > > > > > Hi Stephan,
> >> > > > > >
> >> > > > > > yes, this would work for the cases where an algorithm only
> >> updates
> >> > > the
> >> > > > > > vertex values or only updates the edge values.
> >> > > > > >
> >> > > > > > What we would like to also support is
> >> > > > > > (a) algorithms where both vertices and edges are updated in
> one
> >> > > > iteration
> >> > > > > > (b) algorithms where the graph structure changes from one
> >> iteration
> >> > > to
> >> > > > > the
> >> > > > > > next and
> >> > > > > > (c) branching inside an iteration, i.e. executing a different
> >> > > > "iteration
> >> > > > > > body" based on some condition.
> >> > > > > >
> >> > > > > > We can still currently implement those with regular Flink
> >> iteration
> >> > > > > > operators, but the resulting code is not that nice or
> efficient.
> >> > > > > > For example, if we want to update both edges and vertex
> values,
> >> we
> >> > > can
> >> > > > > > still create a solution set where the vertex values are
> >> attached to
> >> > > > each
> >> > > > > > edge.
> >> > > > > > Regarding branching inside an iteration, we can use an
> >> aggregator
> >> > > that
> >> > > > > > tracks the iteration phase, but then we need to somehow make
> the
> >> > > > > different
> >> > > > > > phases to consist of the same operators and also check the
> >> > branching
> >> > > > > > condition inside each UDF.
> >> > > > > >
> >> > > > > > Cheers,
> >> > > > > > V.
> >> > > > > >
> >> > > > > >
> >> > > > > > On 23 February 2015 at 11:05, Stephan Ewen <se...@apache.org>
> >> > wrote:
> >> > > > > >
> >> > > > > > > As a workaround, it should always work to get the Edge and
> >> Vertex
> >> > > > data
> >> > > > > > set
> >> > > > > > > from the graph and use the regular Fink iteration operators?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <
> >> > > > > > > vasilikikalavri@gmail.com
> >> > > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi,
> >> > > > > > > >
> >> > > > > > > > yes, I was referring to the parallel Boruvka algorithm.
> >> There
> >> > are
> >> > > > > > several
> >> > > > > > > > ways to implement this one in Flink and I believe that the
> >> one
> >> > > > > > described
> >> > > > > > > in
> >> > > > > > > > the paper (vertex-centric) is not the most elegant one :)
> >> > > > > > > >
> >> > > > > > > > Andra is now working on an idea that uses the delta
> >> iteration
> >> > > > > > abstraction
> >> > > > > > > > and we believe that it will be both more efficient and
> >> easier
> >> > to
> >> > > > > > > > understand. It has the edges in the solution set and the
> >> > vertices
> >> > > > in
> >> > > > > > the
> >> > > > > > > > workset, so it follows the pattern I describe in (2) in my
> >> > > previous
> >> > > > > > > e-mail.
> >> > > > > > > > As a next step, we would like to see how having an
> iteration
> >> > > > operator
> >> > > > > > > that
> >> > > > > > > > could update the whole graph -what I describe as (3)-
> would
> >> > make
> >> > > > this
> >> > > > > > > even
> >> > > > > > > > nicer.
> >> > > > > > > >
> >> > > > > > > > Any ideas are highly welcome!
> >> > > > > > > >
> >> > > > > > > > Cheers,
> >> > > > > > > > V.
> >> > > > > > > >
> >> > > > > > > > On 22 February 2015 at 16:32, Andra Lungu <
> >> > lungu.andra@gmail.com
> >> > > >
> >> > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi Alex,
> >> > > > > > > > >
> >> > > > > > > > > Vasia is talking about the second version(presented
> >> Friday)
> >> > of
> >> > > > > > Parallel
> >> > > > > > > > > Boruvka, which can be found here:
> >> > > > > > > > > https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
> >> > > > > > > > >
> >> > > > > > > > > I will propose the third, non-Pregel like approach
> >> directly
> >> > to
> >> > > > > Gelly
> >> > > > > > > > soon.
> >> > > > > > > > >
> >> > > > > > > > > If you have additional questions, I will be happy to
> >> answer
> >> > > them.
> >> > > > > > > > >
> >> > > > > > > > > Andra
> >> > > > > > > > >
> >> > > > > > > > > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov <
> >> > > > > > > > > alexander.s.alexandrov@gmail.com> wrote:
> >> > > > > > > > >
> >> > > > > > > > > > Hi Vasia,
> >> > > > > > > > > >
> >> > > > > > > > > > I am trying to look at the problem in more detail.
> Which
> >> > > > version
> >> > > > > of
> >> > > > > > > the
> >> > > > > > > > > MST
> >> > > > > > > > > > are you talking about?
> >> > > > > > > > > >
> >> > > > > > > > > > Right now in the Gelly repository I can only find the
> >> SSSP
> >> > > > > example
> >> > > > > > > > > > (parallel Bellman-Ford) from Section 4.2 in [1].
> >> > > > > > > > > >
> >> > > > > > > > > > However, it seems that the issues encountered by Andra
> >> are
> >> > > > > related
> >> > > > > > to
> >> > > > > > > > the
> >> > > > > > > > > > implementation of Parallel Boruvka (Section 3.2 in
> >> [2]). Is
> >> > > > that
> >> > > > > > > > correct?
> >> > > > > > > > > >
> >> > > > > > > > > > Regards,
> >> > > > > > > > > > A.
> >> > > > > > > > > >
> >> > > > > > > > > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> >> > > > > > > > > > [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
> >> > > > > > > > > >
> >> > > > > > > > > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri <
> >> > > > > > > vasilikikalavri@gmail.com
> >> > > > > > > > >:
> >> > > > > > > > > >
> >> > > > > > > > > > > Hello beautiful Flink people,
> >> > > > > > > > > > >
> >> > > > > > > > > > > during the past few days, Andra and I have been
> >> > discussing
> >> > > > > about
> >> > > > > > > how
> >> > > > > > > > to
> >> > > > > > > > > > > extend Gelly's iteration methods.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Alexander's course (and his awesome students) has
> >> made it
> >> > > > > obvious
> >> > > > > > > > that
> >> > > > > > > > > > > vertex-centric iterations are not the best fit for
> >> > > algorithms
> >> > > > > > which
> >> > > > > > > > > don't
> >> > > > > > > > > > > follow the common "propagate-update" pattern. For
> >> > example,
> >> > > > > Andra
> >> > > > > > is
> >> > > > > > > > > > working
> >> > > > > > > > > > > on an implementation of Minimum Spanning Tree, which
> >> > > requires
> >> > > > > > > > branching
> >> > > > > > > > > > > inside an iteration and also requires a convergence
> >> check
> >> > > of
> >> > > > an
> >> > > > > > > > > internal
> >> > > > > > > > > > > iteration. Others also reported similar issues [1,
> 2].
> >> > > Trying
> >> > > > > to
> >> > > > > > > fit
> >> > > > > > > > > such
> >> > > > > > > > > > > algorithms to the vertex-centric model leads to long
> >> and
> >> > > ugly
> >> > > > > > code,
> >> > > > > > > > > e.g.
> >> > > > > > > > > > > aggregators to keep track of algorithm phases,
> >> > duplicating
> >> > > > > data,
> >> > > > > > > etc.
> >> > > > > > > > > > >
> >> > > > > > > > > > > One limitation of the vertex-centric and the
> upcoming
> >> GAS
> >> > > > model
> >> > > > > > is
> >> > > > > > > > that
> >> > > > > > > > > > > they both only allow the vertex values to be updated
> >> in
> >> > > each
> >> > > > > > > > iteration.
> >> > > > > > > > > > > However, for some algorithms we need to update the
> >> edge
> >> > > > values
> >> > > > > > and
> >> > > > > > > in
> >> > > > > > > > > > > others we need to update both. In even more complex
> >> > > > situations
> >> > > > > > > (like
> >> > > > > > > > > > > Andra's MST) in some iterations we need to update
> the
> >> > > vertex
> >> > > > > > values
> >> > > > > > > > and
> >> > > > > > > > > > in
> >> > > > > > > > > > > some iterations we need to update the edge values.
> >> > > > > > > > > > > Another problem is that we currently don't have a
> way
> >> to
> >> > > > allow
> >> > > > > > > > > different
> >> > > > > > > > > > > computational phases inside an iteration. This is
> >> > something
> >> > > > > that
> >> > > > > > > > Giraph
> >> > > > > > > > > > > solves with master compute, a function that is
> >> executed
> >> > > once
> >> > > > > > before
> >> > > > > > > > > each
> >> > > > > > > > > > > superstep and sets the computation function.
> >> > > > > > > > > > >
> >> > > > > > > > > > > All that said, I believe that we can solve most of
> >> these
> >> > > > issues
> >> > > > > > if
> >> > > > > > > we
> >> > > > > > > > > > > nicely expose Flink's iteration operators in Gelly.
> I
> >> can
> >> > > see
> >> > > > > the
> >> > > > > > > > > > following
> >> > > > > > > > > > > cases:
> >> > > > > > > > > > >
> >> > > > > > > > > > > 1. Bulk & delta iterations where the solution set is
> >> the
> >> > > > vertex
> >> > > > > > > > > dataset:
> >> > > > > > > > > > > this will be similar to vertex-centric and GAS, but
> >> will
> >> > > > allow
> >> > > > > > more
> >> > > > > > > > > > > flexible dataflows inside the iteration.
> >> > > > > > > > > > > 2. Bulk & delta iterations where the solution set is
> >> the
> >> > > edge
> >> > > > > > > > dataset:
> >> > > > > > > > > > for
> >> > > > > > > > > > > the cases where we need to update edge values.
> >> > > > > > > > > > > 3. Bulk & delta iterations where the solution set is
> >> the
> >> > > > Graph:
> >> > > > > > > this
> >> > > > > > > > > will
> >> > > > > > > > > > > cover more complex cases, where the algorithm
> updates
> >> > both
> >> > > > > > vertices
> >> > > > > > > > and
> >> > > > > > > > > > > edges or even adds/removes vertices/edges, i.e.
> >> updates
> >> > the
> >> > > > > whole
> >> > > > > > > > > Graph.
> >> > > > > > > > > > >
> >> > > > > > > > > > > What do you think? I can see 1 & 2 being very easy
> to
> >> > > > > implement,
> >> > > > > > > but
> >> > > > > > > > I
> >> > > > > > > > > > > suspect 3 won't be that easy (but so awesome to have
> >> ^^).
> >> > > > > > > > > > > Would it work the way a Graph is represented now,
> i.e.
> >> > > with 2
> >> > > > > > > > DataSets?
> >> > > > > > > > > > >
> >> > > > > > > > > > > Any comment, idea, pointer would be much
> appreciated!
> >> > Thank
> >> > > > you
> >> > > > > > ^^
> >> > > > > > > > > > >
> >> > > > > > > > > > > Cheers,
> >> > > > > > > > > > > -V.
> >> > > > > > > > > > >
> >> > > > > > > > > > > [1]:
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> >> > > > > > > > > > > [2]:
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>