You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by andralungu <gi...@git.apache.org> on 2015/09/12 22:09:02 UTC

[GitHub] flink pull request: [FLINK-2661] Added Node Splitting Methods

GitHub user andralungu opened a pull request:

    https://github.com/apache/flink/pull/1124

    [FLINK-2661] Added Node Splitting Methods

    Social media graphs, citation networks or even protein networks have a common property: their degree distribution follows a power-law curve. This structure raises challenges to both vertex-centric and GSA/GAS models because they uniformly process vertices, regardless of their degree distribution. This leads to large execution time stalls: vertices wait for skewed nodes to finish computation [synchronous]. 
    
    This PR aims to diminish the impact of high-degree nodes by proposing four main functions: `determinieSkewedVertices`, `treeDeAggregate` (splits a node into subnodes, recursively, in levels), `propagateValuesToSplitVertices` (useful when the algorithm performs more than one superstep), `treeAggregate` (brings the graph back to its initial state). 
    
    These functions modify a graph at a high-level, making its degree distribution more uniform. The method does not need any special partitioning or runtime modification and (for skewed networks and computationally intensive algorithms) can speed up the run time by a factor of two.     
    
    I added an example: NodeSplittingJaccardSimilarityMeasure, for which I needed to split the overall sequence of operations to two functions to be able to test the result. Calling the entire main method would have resulted in the "Two few memory segments etc" exception - too many operations called within one test, in other words. 
    
    For more info, please consult the additional entry in the documentation. 
    
    If we reach a common point and this PR gets merged, I will also follow @fhueske 's suggestion from the mailing list - adding a Split version of each of the library methods to allow users to verify whether their run time improves. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andralungu/flink splitJaccardFlink

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1124.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1124
    
----
commit b02d0917edcd5f3c8846fe01044afd7444a58c08
Author: Andra Lungu <lu...@gmail.com>
Date:   2015-09-12T10:25:20Z

    [FLINK-2661] Added Node Splitting Methods
    
    [FLINK-2661] Minor modifications in the docs
    
    [FLINK-2661] pdf to png

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2661] Added Node Splitting Methods

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/1124#issuecomment-139959564
  
    Hi @andralungu,
    
    I was under the impression that we never really reached consensus in the mailing list thread regarding this addition.
    
    I definitely think we need to handle skewed graphs and you've done great work, but I wouldn't add this to Gelly at its current state.
    
    - This is a very recent method that has not been tested thoroughly (apart from the experiments in your thesis work). Its benefits and overheads are not yet well understood.
    - The API certainly needs rethinking. This should be a transparent method that would be easy to activate with a flag/option. Right now, it seems to me that it's too complicated to use and can easily allow erroneous implementations.
    
    For now I would suggest that we keep this in your personal repository and we link to it from the Gelly documentation as additional/experimental feature. After we have better understanding of the technique and we have thought of a nicer API, we can reconsider adding it. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2661] Added Node Splitting Methods

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/1124#issuecomment-140146467
  
    Sure, I'm not saying you shouldn't have opened this PR. I'm trying to save you some time.
    
    So, my opinion is that in order for this to be useful, we need 2 things:
    - Understand the performance implications of the method. When is it beneficial? What's the memory overhead? What's the pre-processing overhead? How does it depend on the input? What if I give a wrong threshold? etc.
    - Make this as automatic as possible. Ideally, the user would only have to give the combiner function and the threshold. Splitting and merge should be handled internally. Actually, we could probably find ways to automatically determine the threshold, too, based on the graph, the current load and the available memory.
    
    These are not easy tasks and will need a lot of work. That's why I'm proposing to link to the current state of this method from the Gelly docs, so that we can get feedback. We could create something like a "research projects on Gelly" page, where we link to work in progress from our roadmap tasks.
    
    That's just my view, but let's see what other people think :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2661] Added Node Splitting Methods

Posted by andralungu <gi...@git.apache.org>.
Github user andralungu commented on the pull request:

    https://github.com/apache/flink/pull/1124#issuecomment-140131099
  
    Hi @vasia ,
    
    There was no clear rejection in the discussion, so I thought I'd try my luck :)
    
    This technique could be enabled by using a flag only if we introduce a node split version of all the library methods. Otherwise, for newly written code, users would still have to pass a combine function according to their algorithm. They'd also have to pass the threshold that differentiates high-degree vertices from low-degree vertices. What is more, they still need to split and aggregate before and after each step.
    
    These are aspects that cannot be guessed and that highly depend on the algorithm. 
    
    Also, what I had in mind regarding this PR was that we would discuss and  try to improve the code a bit, rather than immediately discarding it :(


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2661] Added Node Splitting Methods

Posted by andralungu <gi...@git.apache.org>.
Github user andralungu closed the pull request at:

    https://github.com/apache/flink/pull/1124


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---