You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by greghogan <gi...@git.apache.org> on 2016/04/15 22:05:44 UTC

[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

GitHub user greghogan opened a pull request:

    https://github.com/apache/flink/pull/1896

    [FLINK-3768] [gelly] Clustering Coefficient

    Provides an algorithm for local clustering coefficient and dependent functions for degree annotation, algorithm caching, and graph translation.
    
    I worked to improve the performance of `TriangleEnumerator`. Perhaps the API has changed since `Edge.reverse()` is not in-place and the edges were not being sorted by degree. The `JoinHint` is also important so that the `Triad`s are not spilled to disk.
    
    On an AWS ec2.4xlarge (16 vcores, 30 GiB) I am seeing for the following timings of 5s, 29s, and 183s for `TriangleListing`. With `TriangleEnumerator` the timings are 7s, 45s, and 281s. Without the `JoinHint` the latter `TriangleEnumerator` timings are 58s and 347s.
    
    Scale | ChecksumHashCode | Count
    ------|----------------------------|----------
    16 | 0x0000d9086985f4ce | 15616010
    18 | 0x0010eeb32a441365 | 82781436
    20 | 0x014a9434bb57ddef | 423780284
    
    The command I had used to run the tests:
    ```
    ./bin/flink run -class org.apache.flink.graph.examples.TriangleListing ~/flink-gelly-examples_2.10-1.1-SNAPSHOT.jar --clip_and_flip false --output print --output hash --scale 16 --listing
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/greghogan/flink 3768_clustering_coefficient

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1896.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1896
    
----
commit aa1141f4d34f7af9c092ec76bf1a81de310aed16
Author: Greg Hogan <co...@greghogan.com>
Date:   2016-04-13T13:28:38Z

    [FLINK-3768] [gelly] Clustering Coefficient
    
    Provides an algorithm for local clustering coefficient and dependent
    functions for degree annotation, algorithm caching, and graph translation.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1896#discussion_r62708062
  
    --- Diff: flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/LocalClusteringCoefficient.java ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.examples;
    +
    +import org.apache.commons.math3.random.JDKRandomGenerator;
    +import org.apache.flink.api.common.JobExecutionResult;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.ExecutionEnvironment;
    +import org.apache.flink.api.java.io.CsvOutputFormat;
    +import org.apache.flink.api.java.utils.DataSetUtils;
    +import org.apache.flink.api.java.utils.ParameterTool;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.asm.translate.LongValueToIntValue;
    +import org.apache.flink.graph.asm.translate.TranslateGraphIds;
    +import org.apache.flink.graph.generator.RMatGraph;
    +import org.apache.flink.graph.generator.random.JDKRandomGeneratorFactory;
    +import org.apache.flink.graph.generator.random.RandomGenerableFactory;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.types.NullValue;
    +
    +import java.text.NumberFormat;
    +
    +public class LocalClusteringCoefficient {
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the pull request:

    https://github.com/apache/flink/pull/1896#issuecomment-219416157
  
    If there are no further review comments I'll look at merging this today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

Posted by hsaputra <gi...@git.apache.org>.
Github user hsaputra commented on the pull request:

    https://github.com/apache/flink/pull/1896#issuecomment-218210466
  
    Please address my comments about adding Javadoc header info to ALL new Java class added to source repo. Thx!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/1896#issuecomment-210778482
  
    Ideally, each PR should make a single change and reference a single JIRA issue. If you could break this into smaller changes, that'd be great. Otherwise, it would really help if you could give a more detailed description of changes and additions. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the pull request:

    https://github.com/apache/flink/pull/1896#issuecomment-218194878
  
    I have rebased this PR against the newly committed dependent features so it should be good for discussion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/1896


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

Posted by hsaputra <gi...@git.apache.org>.
Github user hsaputra commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1896#discussion_r62704246
  
    --- Diff: flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/LocalClusteringCoefficient.java ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.examples;
    +
    +import org.apache.commons.math3.random.JDKRandomGenerator;
    +import org.apache.flink.api.common.JobExecutionResult;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.ExecutionEnvironment;
    +import org.apache.flink.api.java.io.CsvOutputFormat;
    +import org.apache.flink.api.java.utils.DataSetUtils;
    +import org.apache.flink.api.java.utils.ParameterTool;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.asm.translate.LongValueToIntValue;
    +import org.apache.flink.graph.asm.translate.TranslateGraphIds;
    +import org.apache.flink.graph.generator.RMatGraph;
    +import org.apache.flink.graph.generator.random.JDKRandomGeneratorFactory;
    +import org.apache.flink.graph.generator.random.RandomGenerableFactory;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.types.NullValue;
    +
    +import java.text.NumberFormat;
    +
    +public class LocalClusteringCoefficient {
    --- End diff --
    
    I know this is just an example, but would be nice to add Javadoc information to give high level information why this class exists.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the pull request:

    https://github.com/apache/flink/pull/1896#issuecomment-210709650
  
    It's certainly nicer to review small PRs, but I also like to present the big picture and context in which the features will be used. What if I leave this here, break out the dependencies into new PRs, and then rebase this PR down if those features are accepted? I'm not quite sure what a design document would cover. There's not much sophisticated here as with SG or GSA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/1896#issuecomment-219454345
  
    Hi @greghogan,
    I haven't had time to review this PR. If this is blocking you or is urgent for some reason, please go ahead. Otherwise, I could take a look later this week.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

Posted by hsaputra <gi...@git.apache.org>.
Github user hsaputra commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1896#discussion_r59960582
  
    --- Diff: flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/TriangleListing.java ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.examples;
    +
    +import org.apache.commons.math3.random.JDKRandomGenerator;
    +import org.apache.flink.api.common.JobExecutionResult;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.ExecutionEnvironment;
    +import org.apache.flink.api.java.io.CsvOutputFormat;
    +import org.apache.flink.api.java.utils.DataSetUtils;
    +import org.apache.flink.api.java.utils.ParameterTool;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.generator.RMatGraph;
    +import org.apache.flink.graph.generator.random.JDKRandomGeneratorFactory;
    +import org.apache.flink.graph.generator.random.RandomGenerableFactory;
    +import org.apache.flink.graph.library.TriangleEnumerator;
    +import org.apache.flink.graph.utils.Translate;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.types.NullValue;
    +
    +import java.text.NumberFormat;
    +
    +public class TriangleListing {
    --- End diff --
    
    Could add JavaDoc to ALL new Java or Scala classes added to source? Explanation why and how it is interact with other code is something we are looking for. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/1896#issuecomment-210663735
  
    Hi @greghogan,
    thank you for the PR. This is a big addition! Are all the changes related to the clustering coefficient JIRA? Could it maybe be split into smaller PRs in order to make reviewing easier?
    For big changes like this, it usually a good idea to create a design document to show the high-level approach and functionality to be added and discuss it with the community. Do you think it would make sense to do this?
    Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

Posted by hsaputra <gi...@git.apache.org>.
Github user hsaputra commented on the pull request:

    https://github.com/apache/flink/pull/1896#issuecomment-218637517
  
    Thanks for the docs update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---