You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by andralungu <gi...@git.apache.org> on 2015/08/25 14:14:05 UTC

[GitHub] flink pull request: [FLINK-2570] [gelly] Added a Triangle Count Li...

GitHub user andralungu opened a pull request:

    https://github.com/apache/flink/pull/1054

    [FLINK-2570] [gelly] Added a Triangle Count Library Method

    This PR adds a Triangle Count Library Method to Gelly. 
    
    I decided to add the GAS version because it is faster than vertex-centric for most graph data sets.
    
    This implementation has been extensively tested on a 30-node, 16GB RAM/machine, cluster.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andralungu/flink triangleCount

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1054.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1054
    
----
commit 99ef9720f5c70ab2029b21017d6b9d3100c827e8
Author: Andra Lungu <lu...@gmail.com>
Date:   2015-08-25T12:07:18Z

    [FLINK-2570] [gelly] Added a Triangle Count Library Method

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2570] [gelly] Added a Triangle Count Li...

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1054#discussion_r37869349
  
    --- Diff: flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/library/GSATriangleCount.java ---
    @@ -0,0 +1,187 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.library;
    +
    +
    +import org.apache.flink.api.common.functions.MapFunction;
    +import org.apache.flink.api.common.functions.ReduceFunction;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.ExecutionEnvironment;
    +import org.apache.flink.api.java.tuple.Tuple1;
    +import org.apache.flink.api.java.tuple.Tuple2;
    +import org.apache.flink.graph.GraphAlgorithm;
    +import org.apache.flink.graph.ReduceNeighborsFunction;
    +import org.apache.flink.graph.Vertex;
    +import org.apache.flink.graph.Edge;
    +import org.apache.flink.graph.Triplet;
    +import org.apache.flink.graph.EdgeDirection;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.types.NullValue;
    +
    +import java.util.TreeMap;
    +
    +/**
    + * Triangle Count Algorithm.
    + *
    + * This algorithm operates in three phases. First, vertices select neighbors with id greater than theirs
    + * and send messages to them. Each received message is then propagated to neighbors with higher id.
    + * Finally, if a node encounters the target id in the list of received messages, it increments the number
    + * of triangles found.
    + *
    + * This implementation is non - iterative. The total number of triangles can be determined by performing
    + * a single pass through the graph.
    --- End diff --
    
    The implementation is not iterative indeed, but I wouldn't call it "single-pass". Single-pass means that each edge/vertex of the graph is only read and processed once. That would be the case, e.g. in a streaming implementation. I would simply remove the 2nd sentence here :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2570] [gelly] Added a Triangle Count Li...

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/1054#issuecomment-134570234
  
    Hi @andralungu. Could you please add a slightly more detailed description of your implementation. This would also be helpful in the JIRA description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2570] [gelly] Added a Triangle Count Li...

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/1054#issuecomment-134847237
  
    I think you should wait a bit. There is an open discussion in the mailing list on whether to fork 0.9.1 out of the current master. If there is consensus and you merge this, then this will be one more change to revert (even though this is not API breaking, but it's new functionality, so I'm not really sure).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2570] [gelly] Added a Triangle Count Li...

Posted by andralungu <gi...@git.apache.org>.
Github user andralungu commented on the pull request:

    https://github.com/apache/flink/pull/1054#issuecomment-134845522
  
    Great! I'll merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2570] [gelly] Added a Triangle Count Li...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/1054


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2570] [gelly] Added a Triangle Count Li...

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1054#discussion_r37869769
  
    --- Diff: flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/library/GSATriangleCount.java ---
    @@ -0,0 +1,187 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.library;
    +
    +
    +import org.apache.flink.api.common.functions.MapFunction;
    +import org.apache.flink.api.common.functions.ReduceFunction;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.ExecutionEnvironment;
    +import org.apache.flink.api.java.tuple.Tuple1;
    +import org.apache.flink.api.java.tuple.Tuple2;
    +import org.apache.flink.graph.GraphAlgorithm;
    +import org.apache.flink.graph.ReduceNeighborsFunction;
    +import org.apache.flink.graph.Vertex;
    +import org.apache.flink.graph.Edge;
    +import org.apache.flink.graph.Triplet;
    +import org.apache.flink.graph.EdgeDirection;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.types.NullValue;
    +
    +import java.util.TreeMap;
    +
    +/**
    + * Triangle Count Algorithm.
    + *
    + * This algorithm operates in three phases. First, vertices select neighbors with id greater than theirs
    + * and send messages to them. Each received message is then propagated to neighbors with higher id.
    + * Finally, if a node encounters the target id in the list of received messages, it increments the number
    + * of triangles found.
    + *
    + * This implementation is non - iterative. The total number of triangles can be determined by performing
    + * a single pass through the graph.
    + */
    +public class GSATriangleCount implements
    +		GraphAlgorithm<Long, NullValue, NullValue, DataSet<Tuple1<Integer>>> {
    +
    +	@Override
    +	public DataSet<Tuple1<Integer>> run(Graph<Long, NullValue, NullValue> input) throws Exception {
    +
    +		ExecutionEnvironment env = input.getContext();
    +
    +		// order the edges so that src is always higher than trg
    +		DataSet<Edge<Long, NullValue>> edges = input.getEdges()
    +				.map(new OrderEdges()).distinct();
    --- End diff --
    
    this call to `distinct()` here means that basically if you have 2 edges a->b and b->a in the input, and they are both part of a triangle, then you only count it once?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2570] [gelly] Added a Triangle Count Li...

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1054#discussion_r37867925
  
    --- Diff: flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/example/utils/TriangleCountData.java ---
    @@ -0,0 +1,56 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.example.utils;
    +
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.ExecutionEnvironment;
    +import org.apache.flink.graph.Edge;
    +import org.apache.flink.types.NullValue;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +/**
    + * Provides the default data sets used for the Triangle Count example.
    --- End diff --
    
    There is no example :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2570] [gelly] Added a Triangle Count Li...

Posted by andralungu <gi...@git.apache.org>.
Github user andralungu commented on the pull request:

    https://github.com/apache/flink/pull/1054#issuecomment-134702560
  
    Hi @vasia, 
    
    I clarified the type of input expected. The graph should be undirected. Without the distinct, you get duplicate edges there(and an erroneous number of triangles). The second bullet point is again not an issue because the graph is undirected. 
    The result should be fine. For the SNAP data sets, I got a number equal to theirs on a cluster.
    
    Concerning the runtime, you are right, It's just true for some cases (generally faster by a factor of two) but it highly depends on the data set. So, once this gets merged, I'll go ahead and propose the vertex centric version as well. That way, the user can choose. 
    
    Hope I clarified everything!
    Let me know if you still have questions :)
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2570] [gelly] Added a Triangle Count Li...

Posted by andralungu <gi...@git.apache.org>.
Github user andralungu commented on the pull request:

    https://github.com/apache/flink/pull/1054#issuecomment-134847386
  
    okay...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2570] [gelly] Added a Triangle Count Li...

Posted by andralungu <gi...@git.apache.org>.
Github user andralungu commented on the pull request:

    https://github.com/apache/flink/pull/1054#issuecomment-134590137
  
    You're right @tillrohrmann. I updated the JIRA to also contain a small description :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2570] [gelly] Added a Triangle Count Li...

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/1054#issuecomment-134720344
  
    Thanks for the clarifications Andra! Everything looks good to me now.
    
    I would love to also see a PR of the other implementation. And if you have any insight on which one performs better when, then we should add this as a tip to the descriptions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2570] [gelly] Added a Triangle Count Li...

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/1054#issuecomment-134604410
  
    Thanks for adding this one @andralungu. We really need this algorithm in the library!
    I have a few comments:
    - we should clarify what is the expected input graph format and the output. If I'm not mistaken, it seems you're expecting a directed graph without edge duplicates and you count triangles ignoring edge direction. Is that correct? I would add a clear comment in the usage description about that. If we want to do this even better, we could even add a graph validator for the input.
    - I'm not quite sure what happens when the graph has opposite direction edges, i.e. a->b and b->a, that are both part of a triangle. I would expect that this triangle would be counted twice, but it seems to me that you're only counting it once. Is there a reason for that?
    - as you've been experimenting with this for a while, could you let us know how better is this than your vertex-centric version? Is it always the case? If not, do you think it would make sense to add both implementations in the library and let the users choose?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2570] [gelly] Added a Triangle Count Li...

Posted by andralungu <gi...@git.apache.org>.
Github user andralungu commented on the pull request:

    https://github.com/apache/flink/pull/1054#issuecomment-138497145
  
    Since the bugfix version was released days ago, I guess this is safe to merge...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---