You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Vasiliki Kalavri <va...@gmail.com> on 2016/01/06 23:29:49 UTC

[gelly] partition-centric iterations

Hi squirrels,

here's some more Gelly iteration love for you ;)

Together with Kien and Nikola (students at KTH, cc-ed), we looked into
partition-centric iterations for Gelly. The idea of the partition-centric
model is to expose the graph partition structure to the user (not just a
single vertex, but all vertices belonging to the same partition at once).
It allows for local graph structure exploitation inside a partition and
potentially avoids unnecessary communication. The model was first
introduced in [1] and we used it as inspiration, but our current
implementation is closer to the Spargel model.

Kien and Nikola prepared some slides with their design choices and
experiments [2]. The connected components results are quite impressive :)
There is still room for optimization, but you can find the wip repository
in [3] if you're interested.

What do you think about exposing this as a Gelly iteration model? The task
has been in the Gelly roadmap and there are certainly performance
advantages for some algorithms. This implementation can be basically
considered as a Spargel optimization (existing VertexUpdateFunctions from
Spargel can actually be re-used). On the other hand, the model is quite
low-level compared to Spargel and vertex-centric, since users have to deal
with inside-partition algorithms themselves (e.g. compare the connected
components implementation [4] with the existing Spargel/GSA ones).

Thanks!
-Vasia.

[1]: http://researcher.ibm.com/researcher/files/us-ytian/giraph++.pdf
[2]:
https://drive.google.com/file/d/0BzQJrI2eGlyYRmlKTjMyVEk1T0U/view?usp=sharing
[3]: https://github.com/vasia/gelly-partition-centric
[4]:
https://github.com/vasia/gelly-partition-centric/blob/master/src/main/java/org/apache/flink/graph/partition/centric/library/PCConnectedComponents.java

Re: [gelly] partition-centric iterations

Posted by Truong Duc Kien <du...@gmail.com>.

Hi Martin,

> What I got in mind is something like:
> 1 Compute partitions or a set of subgraphs
>  (via CC, LP, Pattern Matching, ...)
> 2 Partition Vertices by computed Partition/Subgraph ID
> 3 Compute Algorithm X (Page Rank, BC, SSSP, FSM...) per
> Partition/Subgraph via PC iteration

I did try this idea at the start of the project, but was unable to
store the computed partitions efficiently so I abandoned it.
My original approach was storing each partition as an element
of a DataSet, but the partition object can get very large and
cause big serialization/de-serialization overhead.
This approach also breaks if one partition can't be fitted into memory.

Best regards,
Kien


On Thu, Jan 7, 2016 at 5:14 PM, Vasiliki Kalavri <va...@gmail.com>
wrote:

> Hi Martin,
>
> thanks for your input!
>
>
> On 7 January 2016 at 09:36, Martin Junghanns <m....@mailbox.org>
> wrote:
>
> > Hi,
> >
> > this would be a very nice addition! I had a glimpse look into the PC
> > implementation and the two library algorithms and when you get the
> > idea, it is easy to follow what's happening. The benchmark results are
> > also very promising.
> >
> > I got some questions about partitions:
> >
> > 1) I was wondering if the coGroup after the PartitionProcessUDF is aware
> > of the initial partitioning. To be more precise, is it guaranteed, that
> > the (vertex-)partitions do not change during the whole iteration and the
> > RichEdges "stay" on the same worker?
> >
>
>
> I think you can find the questions to your answers in the attached
> generated plan [1].
> The edges are cached, meaning that the partitions remain on the same
> workers across iterations.
> Records from the first coGroup to the mapPartition are forwarded (no
> shuffling).
> The output of the mapPartition operator is a tuple of <vertexID, message>,
> which is sorted and shuffled to reach the destination vertices.
>
>
>
> > 2) If 1) is true, would it work to call partition(udf) on the vertex
> > data set before the PC iteration starts?
> >
>
> Do you mean replacing the default hash partitioning with some custom
> partition function? I guess that'd would work, yes.
>
>
>
> > 3) If 1) is false, would it work to call partition(udf) after the
> > coGroup during the delta iteration to achieve partition stability?
> >
> > What I got in mind is something like:
> >
> > 1 Compute partitions or a set of subgraphs
> >   (via CC, LP, Pattern Matching, ...)
> > 2 Partition Vertices by computed Partition/Subgraph ID
> > 3 Compute Algorithm X (Page Rank, BC, SSSP, FSM...) per
> > Partition/Subgraph via PC iteration
> >
> > Again, nice work!
> >
> > Best,
> > Martin
> >
>
>
>
> Cheers,
> -Vasia.
>
> [1]:
>
> https://drive.google.com/file/d/0BzQJrI2eGlyYSHZJTHFObmgxTXM/view?usp=sharing
>
>
> >
> > On 06.01.2016 23:29, Vasiliki Kalavri wrote:
> > > Hi squirrels,
> > >
> > > here's some more Gelly iteration love for you ;)
> > >
> > > Together with Kien and Nikola (students at KTH, cc-ed), we looked into
> > > partition-centric iterations for Gelly. The idea of the
> partition-centric
> > > model is to expose the graph partition structure to the user (not just
> a
> > > single vertex, but all vertices belonging to the same partition at
> once).
> > > It allows for local graph structure exploitation inside a partition and
> > > potentially avoids unnecessary communication. The model was first
> > > introduced in [1] and we used it as inspiration, but our current
> > > implementation is closer to the Spargel model.
> > >
> > > Kien and Nikola prepared some slides with their design choices and
> > > experiments [2]. The connected components results are quite impressive
> :)
> > > There is still room for optimization, but you can find the wip
> repository
> > > in [3] if you're interested.
> > >
> > > What do you think about exposing this as a Gelly iteration model? The
> > task
> > > has been in the Gelly roadmap and there are certainly performance
> > > advantages for some algorithms. This implementation can be basically
> > > considered as a Spargel optimization (existing VertexUpdateFunctions
> from
> > > Spargel can actually be re-used). On the other hand, the model is quite
> > > low-level compared to Spargel and vertex-centric, since users have to
> > deal
> > > with inside-partition algorithms themselves (e.g. compare the connected
> > > components implementation [4] with the existing Spargel/GSA ones).
> > >
> > > Thanks!
> > > -Vasia.
> > >
> > > [1]: http://researcher.ibm.com/researcher/files/us-ytian/giraph++.pdf
> > > [2]:
> > >
> >
> https://drive.google.com/file/d/0BzQJrI2eGlyYRmlKTjMyVEk1T0U/view?usp=sharing
> > > [3]: https://github.com/vasia/gelly-partition-centric
> > > [4]:
> > >
> >
> https://github.com/vasia/gelly-partition-centric/blob/master/src/main/java/org/apache/flink/graph/partition/centric/library/PCConnectedComponents.java
> > >
> >
>

Re: [gelly] partition-centric iterations

Posted by Vasiliki Kalavri <va...@gmail.com>.

Hi Martin,

thanks for your input!


On 7 January 2016 at 09:36, Martin Junghanns <m....@mailbox.org>
wrote:

> Hi,
>
> this would be a very nice addition! I had a glimpse look into the PC
> implementation and the two library algorithms and when you get the
> idea, it is easy to follow what's happening. The benchmark results are
> also very promising.
>
> I got some questions about partitions:
>
> 1) I was wondering if the coGroup after the PartitionProcessUDF is aware
> of the initial partitioning. To be more precise, is it guaranteed, that
> the (vertex-)partitions do not change during the whole iteration and the
> RichEdges "stay" on the same worker?
>


I think you can find the questions to your answers in the attached
generated plan [1].
The edges are cached, meaning that the partitions remain on the same
workers across iterations.
Records from the first coGroup to the mapPartition are forwarded (no
shuffling).
The output of the mapPartition operator is a tuple of <vertexID, message>,
which is sorted and shuffled to reach the destination vertices.



> 2) If 1) is true, would it work to call partition(udf) on the vertex
> data set before the PC iteration starts?
>

Do you mean replacing the default hash partitioning with some custom
partition function? I guess that'd would work, yes.



> 3) If 1) is false, would it work to call partition(udf) after the
> coGroup during the delta iteration to achieve partition stability?
>
> What I got in mind is something like:
>
> 1 Compute partitions or a set of subgraphs
>   (via CC, LP, Pattern Matching, ...)
> 2 Partition Vertices by computed Partition/Subgraph ID
> 3 Compute Algorithm X (Page Rank, BC, SSSP, FSM...) per
> Partition/Subgraph via PC iteration
>
> Again, nice work!
>
> Best,
> Martin
>



Cheers,
-Vasia.

[1]:
https://drive.google.com/file/d/0BzQJrI2eGlyYSHZJTHFObmgxTXM/view?usp=sharing


>
> On 06.01.2016 23:29, Vasiliki Kalavri wrote:
> > Hi squirrels,
> >
> > here's some more Gelly iteration love for you ;)
> >
> > Together with Kien and Nikola (students at KTH, cc-ed), we looked into
> > partition-centric iterations for Gelly. The idea of the partition-centric
> > model is to expose the graph partition structure to the user (not just a
> > single vertex, but all vertices belonging to the same partition at once).
> > It allows for local graph structure exploitation inside a partition and
> > potentially avoids unnecessary communication. The model was first
> > introduced in [1] and we used it as inspiration, but our current
> > implementation is closer to the Spargel model.
> >
> > Kien and Nikola prepared some slides with their design choices and
> > experiments [2]. The connected components results are quite impressive :)
> > There is still room for optimization, but you can find the wip repository
> > in [3] if you're interested.
> >
> > What do you think about exposing this as a Gelly iteration model? The
> task
> > has been in the Gelly roadmap and there are certainly performance
> > advantages for some algorithms. This implementation can be basically
> > considered as a Spargel optimization (existing VertexUpdateFunctions from
> > Spargel can actually be re-used). On the other hand, the model is quite
> > low-level compared to Spargel and vertex-centric, since users have to
> deal
> > with inside-partition algorithms themselves (e.g. compare the connected
> > components implementation [4] with the existing Spargel/GSA ones).
> >
> > Thanks!
> > -Vasia.
> >
> > [1]: http://researcher.ibm.com/researcher/files/us-ytian/giraph++.pdf
> > [2]:
> >
> https://drive.google.com/file/d/0BzQJrI2eGlyYRmlKTjMyVEk1T0U/view?usp=sharing
> > [3]: https://github.com/vasia/gelly-partition-centric
> > [4]:
> >
> https://github.com/vasia/gelly-partition-centric/blob/master/src/main/java/org/apache/flink/graph/partition/centric/library/PCConnectedComponents.java
> >
>

Re: [gelly] partition-centric iterations

Posted by Martin Junghanns <m....@mailbox.org>.

Hi,

this would be a very nice addition! I had a glimpse look into the PC
implementation and the two library algorithms and when you get the
idea, it is easy to follow what's happening. The benchmark results are
also very promising.

I got some questions about partitions:

1) I was wondering if the coGroup after the PartitionProcessUDF is aware
of the initial partitioning. To be more precise, is it guaranteed, that
the (vertex-)partitions do not change during the whole iteration and the
RichEdges "stay" on the same worker?
2) If 1) is true, would it work to call partition(udf) on the vertex
data set before the PC iteration starts?
3) If 1) is false, would it work to call partition(udf) after the
coGroup during the delta iteration to achieve partition stability?

What I got in mind is something like:

1 Compute partitions or a set of subgraphs
  (via CC, LP, Pattern Matching, ...)
2 Partition Vertices by computed Partition/Subgraph ID
3 Compute Algorithm X (Page Rank, BC, SSSP, FSM...) per
Partition/Subgraph via PC iteration

Again, nice work!

Best,
Martin

On 06.01.2016 23:29, Vasiliki Kalavri wrote:
> Hi squirrels,
> 
> here's some more Gelly iteration love for you ;)
> 
> Together with Kien and Nikola (students at KTH, cc-ed), we looked into
> partition-centric iterations for Gelly. The idea of the partition-centric
> model is to expose the graph partition structure to the user (not just a
> single vertex, but all vertices belonging to the same partition at once).
> It allows for local graph structure exploitation inside a partition and
> potentially avoids unnecessary communication. The model was first
> introduced in [1] and we used it as inspiration, but our current
> implementation is closer to the Spargel model.
> 
> Kien and Nikola prepared some slides with their design choices and
> experiments [2]. The connected components results are quite impressive :)
> There is still room for optimization, but you can find the wip repository
> in [3] if you're interested.
> 
> What do you think about exposing this as a Gelly iteration model? The task
> has been in the Gelly roadmap and there are certainly performance
> advantages for some algorithms. This implementation can be basically
> considered as a Spargel optimization (existing VertexUpdateFunctions from
> Spargel can actually be re-used). On the other hand, the model is quite
> low-level compared to Spargel and vertex-centric, since users have to deal
> with inside-partition algorithms themselves (e.g. compare the connected
> components implementation [4] with the existing Spargel/GSA ones).
> 
> Thanks!
> -Vasia.
> 
> [1]: http://researcher.ibm.com/researcher/files/us-ytian/giraph++.pdf
> [2]:
> https://drive.google.com/file/d/0BzQJrI2eGlyYRmlKTjMyVEk1T0U/view?usp=sharing
> [3]: https://github.com/vasia/gelly-partition-centric
> [4]:
> https://github.com/vasia/gelly-partition-centric/blob/master/src/main/java/org/apache/flink/graph/partition/centric/library/PCConnectedComponents.java
>