You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Martin Junghanns (JIRA)" <ji...@apache.org> on 2015/07/27 17:43:04 UTC
[jira] [Created] (FLINK-2411) Add basic graph summarization
algorithm
Martin Junghanns created FLINK-2411:
---------------------------------------
Summary: Add basic graph summarization algorithm
Key: FLINK-2411
URL: https://issues.apache.org/jira/browse/FLINK-2411
Project: Flink
Issue Type: New Feature
Components: Gelly
Affects Versions: master
Reporter: Martin Junghanns
Priority: Minor
Graph summarization determines a structural grouping of similar vertices and edges to condense a graph and thus helps to uncover insights about patterns hidden in the graph. It can be used in OLAP-style operations on the graph and is similar to group by in SQL but on the graph structure instead of rows.
The graph summarization operator represents every vertex group by a single vertex in the summarized graph; edges between vertices in the summary graph represent a group of edges between the vertex group members of the original graph. Summarization is defined by specifying grouping keys for vertices and edges, respectively.
One publication that presents a Map/Reduce based approach is
"Pagrol: Parallel graph olap over large-scale attributed graphs", however they pre-compute the graph-cube before it can be analyzed. With Flink, we can give the user an interactive way of summarizing the graph and do not need to compute the cube beforehand.
A more complex approach focuses on summarization on graph patterns
"SynopSys: Large Graph Analytics in the SAP HANA Database Through Summarization".
However, I want to start with a simple algorithm that summarizes the graph on vertex and optionally edge values and additionally stores a count aggregate at summarized vertices/edges.
Consider the following two examples
(e.g., social network with users from cities and friendships with timestamp):
h4. Input graph:
Vertices (id, value):
(0, Leipzig)
(1, Leipzig)
(2, Dresden)
(3, Dresden)
(4, Dresden)
(5, Berlin)
Edges (source, target, value):
(0, 1, 2014)
(1, 0, 2014)
(1, 2, 2013)
(2, 1, 2013)
(2, 3, 2014)
(3, 2, 2014)
(4, 0, 2013)
(4, 1, 2015)
(5, 2, 2015)
(5, 3, 2015)
h4. Output graph (summarized on vertex value):
Vertices (id, value, count)
(0, Leipzig, 2) // "2 users from Leipzig"
(2, Dresden, 3) // "3 users from Dresden"
(5, Berlin, 1) // "1 user from Berlin"
Edges (source, target, count)
(0, 0, 2) // "2 edges between users in Leipzig"
(0, 2, 1) // "1 edge from users in Leipzig to users in Dresden"
(2, 0, 3) // "3 edges from users in Dresden to users in Leipzig"
(2, 2, 2) // "2 edges between users in Dresden"
(5, 2, 2) // "2 edges from users in Berlin to users in Dresden"
h4. Output graph (summarized on vertex and edge value):
Vertices (id, value, count)
(0, Leipzig, 2)
(2, Dresden, 3)
(5, Berlin, 1)
Edges (source, target, value, count)
(0, 0, 2014, 2) // ...
(0, 2, 2013, 1) // ...
(2, 0, 2013, 2) // "2 edges from users in Dresden to users in Leipzig with timestamp 2013"
(2, 0, 2015, 1) // "1 edge from users in Dresden to users in Leipzig with timestamp 2015"
(2, 2, 2014, 2) // ...
(5, 2, 2015, 2) // ...
I've already implemented two versions of the summarization algorithm in our own project https://github.com/dbs-leipzig/gradoop,
which is a graph analytics stack on top of Hadoop + Gelly/Flink with a fixed data model. You can see the current WIP here:
1 [Abstract summarization|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java]
2 [Implementation using cross|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationCross.java]
3 [Implementation using joins|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationJoin.java]
4 [Tests|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/test/java/org/gradoop/model]/impl/EPGraphSummarizeTest.java
5 [TestGraph|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/dev-support/social-network.pdf]
I would basically use the same implementation as in 3 in combination with KeySelectors to select the grouping keys on vertices and edges.
As you can see in the example, each vertex in the resulting graph has a vertex id that is contained in the original graph. This id is the smallest id among the grouped vertices (e.g., vertices 2, 3 and 4 represent Dresden, so 2 is the group representative). The latter is necessary to correctly assign the summarized edges. Maybe there is a smarter way
to do it of which I did not think yet.
I would like contribute this to Flink and of course, if you have any suggestions/improvements or do not want this at all (hopefully not), please let me know.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)