You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/09/12 22:09:45 UTC
[jira] [Commented] (FLINK-2661) Add a Node Splitting Technique to Overcome the Limitations of Skewed Graphs

    [ https://issues.apache.org/jira/browse/FLINK-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14742216#comment-14742216 ] 

ASF GitHub Bot commented on FLINK-2661:
---------------------------------------

GitHub user andralungu opened a pull request:

    https://github.com/apache/flink/pull/1124

    [FLINK-2661] Added Node Splitting Methods

    Social media graphs, citation networks or even protein networks have a common property: their degree distribution follows a power-law curve. This structure raises challenges to both vertex-centric and GSA/GAS models because they uniformly process vertices, regardless of their degree distribution. This leads to large execution time stalls: vertices wait for skewed nodes to finish computation [synchronous]. 
    
    This PR aims to diminish the impact of high-degree nodes by proposing four main functions: `determinieSkewedVertices`, `treeDeAggregate` (splits a node into subnodes, recursively, in levels), `propagateValuesToSplitVertices` (useful when the algorithm performs more than one superstep), `treeAggregate` (brings the graph back to its initial state). 
    
    These functions modify a graph at a high-level, making its degree distribution more uniform. The method does not need any special partitioning or runtime modification and (for skewed networks and computationally intensive algorithms) can speed up the run time by a factor of two.     
    
    I added an example: NodeSplittingJaccardSimilarityMeasure, for which I needed to split the overall sequence of operations to two functions to be able to test the result. Calling the entire main method would have resulted in the "Two few memory segments etc" exception - too many operations called within one test, in other words. 
    
    For more info, please consult the additional entry in the documentation. 
    
    If we reach a common point and this PR gets merged, I will also follow @fhueske 's suggestion from the mailing list - adding a Split version of each of the library methods to allow users to verify whether their run time improves. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andralungu/flink splitJaccardFlink

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1124.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1124
    
----
commit b02d0917edcd5f3c8846fe01044afd7444a58c08
Author: Andra Lungu <lu...@gmail.com>
Date:   2015-09-12T10:25:20Z

    [FLINK-2661] Added Node Splitting Methods
    
    [FLINK-2661] Minor modifications in the docs
    
    [FLINK-2661] pdf to png

----


> Add a Node Splitting Technique to Overcome the Limitations of Skewed Graphs
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-2661
>                 URL: https://issues.apache.org/jira/browse/FLINK-2661
>             Project: Flink
>          Issue Type: Task
>          Components: Gelly
>    Affects Versions: 0.10
>            Reporter: Andra Lungu
>            Assignee: Andra Lungu
>
> Skewed graphs raise unique challenges to computation models such as Gelly's vertex-centric or GSA iterations. This is mainly because of the fact that these approaches uniformly process vertices regardless of their degree distribution. 
> In vertex-centric, for instance, a skewed node will take more time to process its neighbors compared to the other nodes in the graph. The first will act as a straggler causing the latter to remain idle until it finishes its computation. 
> This issue can be mitigated by splitting a high-degree node into subnodes and evenly distributing the edges to the the resulted subvertices. The computation will then be performed on the split vertex. 
> To this end, we should add a Splitting API on top of Gelly which can help:
> - determine skewed nodes 
> - split them
> - merge them back at the end of the computation, given a user defined combiner.
> To illustrate the usage of these methods, we should add an example as well as a separate entry in the documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)