You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by mfahimazizi <gi...@git.apache.org> on 2015/06/03 01:29:39 UTC

[GitHub] flink pull request: Hits

GitHub user mfahimazizi opened a pull request:

    https://github.com/apache/flink/pull/765

    Hits

    HITS algorithm in Gelly.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mfahimazizi/flink HITS

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/765.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #765
    
----
commit e3fe353a5a4fbca35a65672214f451cda58de6ab
Author: mfahimazizi <mf...@gmail.com>
Date:   2015-05-30T21:53:30Z

    HITS algorithm added

commit 646660272c4c9b4152d2ebfc4bfbd6e04d891408
Author: mfahimazizi <mf...@gmail.com>
Date:   2015-06-02T23:23:15Z

    HITS algorithm added_

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/765#issuecomment-118161158
  
    Hi there!
    What is the status of this PR?
    It seems that travis is failing with checkstyle violations related to using spaces instead of tabs.
    Let us know if you need further help!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by andralungu <gi...@git.apache.org>.
Github user andralungu commented on a diff in the pull request:

    https://github.com/apache/flink/pull/765#discussion_r31677964
  
    --- Diff: flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/library/HITS.java ---
    @@ -0,0 +1,180 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.library;
    +import org.apache.flink.api.common.aggregators.LongSumAggregator;
    +import org.apache.flink.api.common.functions.MapFunction;
    +import org.apache.flink.graph.Edge;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.GraphAlgorithm;
    +import org.apache.flink.graph.Vertex;
    +import org.apache.flink.graph.spargel.MessageIterator;
    +import org.apache.flink.graph.spargel.MessagingFunction;
    +import org.apache.flink.graph.spargel.VertexUpdateFunction;
    +import org.apache.flink.graph.utils.Hits;
    +
    +import java.io.Serializable;
    +
    +
    +/**
    + *
    + * This class implements the HITS algorithm by using flink Gelly API
    + *Hyperlink-Induced Topic Search (HITS; also known as hubs and authorities) is a link analysis algorithm that rates Web pages,
    + *developed by Jon Kleinberg.
    + *
    + * The algorithm performs a series of iterations, each consisting of two basic steps:
    + *
    + * Authority Update: Update each node's Authority score to be equal to the sum of the Hub Scores of each node that
    + * points to it.
    + * That is, a node is given a high authority score by being linked from pages that are recognized as Hubs for information.
    + * Hub Update: Update each node's Hub Score to be equal to the sum of the Authority Scores of each node that it
    + * points to.
    + * That is, a node is given a high hub score by linking to nodes that are considered to be authorities on the subject.
    + *
    + * The Hub score and Authority score for a node is calculated with the following algorithm:
    + *  *Start with each node having a hub score and authority score of 1.
    + *  *Run the Authority Update Rule
    + *  *Run the Hub Update Rule
    + *  *Normalize the values by dividing each Hub score by square root of the sum of the squares of all Hub scores, and
    + *   dividing each Authority score by square root of the sum of the squares of all Authority scores.
    + *  *Repeat from the second step as necessary.
    + *
    + * http://en.wikipedia.org/wiki/HITS_algorithm
    --- End diff --
    
    I would add an @see annotation for the link here, check the community detection library method. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by andralungu <gi...@git.apache.org>.
Github user andralungu commented on a diff in the pull request:

    https://github.com/apache/flink/pull/765#discussion_r31678355
  
    --- Diff: flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/example/HITSExample.java ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.example;
    +
    +import org.apache.flink.api.common.ProgramDescription;
    +import org.apache.flink.api.common.functions.FlatMapFunction;
    +import org.apache.flink.api.common.functions.MapFunction;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.ExecutionEnvironment;
    +import org.apache.flink.graph.Edge;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.Vertex;
    +import org.apache.flink.graph.library.HITS;
    +import org.apache.flink.graph.utils.Hits;
    +import org.apache.flink.graph.utils.Tuple3ToEdgeMap;
    +import org.apache.flink.util.Collector;
    +
    +/**
    + * This program is an example for HITS algorithm.
    + * the result is either a hub value or authority value base user selection.
    + *
    + * If no arguments are provided, the example runs with a random graph of 10 vertices
    + * and random edge weights.
    + */
    +
    +public class HITSExample implements ProgramDescription {
    +    @SuppressWarnings("serial")
    +    public static void main(String[] args) throws Exception {
    +
    +        if(!parseParameters(args)) {
    +            return;
    +        }
    +
    +        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    +        DataSet<Edge<Long, String>> links = getEdgesDataSet(env);
    +
    +        Graph<Long, Double, String> network = Graph.fromDataSet(links, new MapFunction<Long, Double>() {
    +
    +            public Double map(Long value) throws Exception {
    +                return 1.0;
    +            }
    +        }, env);
    +
    +    // add  graph to HITS class with iteration value and hub or authority enum value.
    +        DataSet<Vertex<Long, Double>> HitsValue =network.run(
    +                    new HITS<Long>(Hits.AUTHORITY,maxIterations)).getVertices();
    +
    +        if (fileOutput) {
    +            HitsValue.writeAsCsv(outputPath, "\n", "\t");
    +        } else {
    +            HitsValue.print();
    +        }
    +//        env.execute("HITS algorithm");
    --- End diff --
    
    The env.execute() caused problems for you because you were only testing print(). You don't need to do an env.execute() on the else branch as print() now has the same status as count() and collect(), but this command is still needed on the then branch(i.e. after writeAsCsv). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by hsaputra <gi...@git.apache.org>.
Github user hsaputra commented on a diff in the pull request:

    https://github.com/apache/flink/pull/765#discussion_r31596519
  
    --- Diff: flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/library/HITS.java ---
    @@ -0,0 +1,163 @@
    +package org.apache.flink.graph.library;
    --- End diff --
    
    Missing Apache license header


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/765#issuecomment-119744587
  
    Hi @mfahimazizi!
    Thank you for the update. 
    
    It seems Travis is still complaining about checkstyle. Could you check again please? You can see these errors locally as well, if you run `mvn verify` inside `flink-gelly`.
    
    You are right that inside the `VertexUpdateFunction` you cannot access the edge values. However, you could add the edge value inside the message that you create in the `MessagingFunction` (you can access the edges there with `getEdges()`.
    Would that work for you? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by mfahimazizi <gi...@git.apache.org>.
Github user mfahimazizi commented on the pull request:

    https://github.com/apache/flink/pull/765#issuecomment-140280883
  
    @andralungu and @vasia , we are outside germany :(
    we need to access edges inside vertexupdate function (not inside messaging function), as we mentioned in above comments,



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/765#issuecomment-149148599
  
    Hi @mfahimazizi,
    is there any update regarding this PR? If you're not planning to keep working on it, could you please close this? Thanks a lot!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/765#issuecomment-140903149
  
    Thanks for the explanation @mfahimazizi.
    So, if you put the edge label inside the message that you send in the messaging function, then you will know which values to add during odd and even supersteps. All you have to do is create a message that contains both the node value and the edge label.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by mfahimazizi <gi...@git.apache.org>.
Github user mfahimazizi commented on the pull request:

    https://github.com/apache/flink/pull/765#issuecomment-119188834
  
    Hi @vasia,
    sorry for late reply.
    I set indentation with tabs instead of spaces in IDE.
    and push new updated files.
    as I discuss in jira issue about vertexupdate function, where we couldn't access edges inside this function. if we overload this function with following, then we can identify Hub and Authority edges in undirected graph.
    ```public void updateVertex(Vertex<K, Double> vertex, MessageIterator<Edge<Long,String>,Double>> inMessages)```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by mfahimazizi <gi...@git.apache.org>.
Github user mfahimazizi commented on the pull request:

    https://github.com/apache/flink/pull/765#issuecomment-140633835
  
    Hi @vasia, Thank you.
    for this algorithm, we need to calculate edges values inside  VertexUpdateFunction, so I will try to build my own delta iteration. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by mfahimazizi <gi...@git.apache.org>.
Github user mfahimazizi commented on the pull request:

    https://github.com/apache/flink/pull/765#issuecomment-140871748
  
    Hey @vasia, Thank you.
    sorry it is "calculate node values". 
    first we label the edge as 'H' (hub score), then undirected the graph and label the new edges as 'A' (authority score). in VertexUpdateFunction, in odd iteration we first find edges which has 'H' label and sum its nodes value and even iteration we find edges which has 'A' label and sum its nodes value. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by andralungu <gi...@git.apache.org>.
Github user andralungu commented on a diff in the pull request:

    https://github.com/apache/flink/pull/765#discussion_r31678387
  
    --- Diff: flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/example/HITSExample.java ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.example;
    +
    +import org.apache.flink.api.common.ProgramDescription;
    +import org.apache.flink.api.common.functions.FlatMapFunction;
    +import org.apache.flink.api.common.functions.MapFunction;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.ExecutionEnvironment;
    +import org.apache.flink.graph.Edge;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.Vertex;
    +import org.apache.flink.graph.library.HITS;
    +import org.apache.flink.graph.utils.Hits;
    +import org.apache.flink.graph.utils.Tuple3ToEdgeMap;
    +import org.apache.flink.util.Collector;
    +
    +/**
    + * This program is an example for HITS algorithm.
    + * the result is either a hub value or authority value base user selection.
    + *
    + * If no arguments are provided, the example runs with a random graph of 10 vertices
    + * and random edge weights.
    + */
    +
    +public class HITSExample implements ProgramDescription {
    +    @SuppressWarnings("serial")
    +    public static void main(String[] args) throws Exception {
    +
    +        if(!parseParameters(args)) {
    +            return;
    +        }
    +
    +        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    +        DataSet<Edge<Long, String>> links = getEdgesDataSet(env);
    +
    +        Graph<Long, Double, String> network = Graph.fromDataSet(links, new MapFunction<Long, Double>() {
    +
    +            public Double map(Long value) throws Exception {
    +                return 1.0;
    +            }
    +        }, env);
    +
    +    // add  graph to HITS class with iteration value and hub or authority enum value.
    +        DataSet<Vertex<Long, Double>> HitsValue =network.run(
    +                    new HITS<Long>(Hits.AUTHORITY,maxIterations)).getVertices();
    +
    +        if (fileOutput) {
    +            HitsValue.writeAsCsv(outputPath, "\n", "\t");
    +        } else {
    +            HitsValue.print();
    +        }
    +//        env.execute("HITS algorithm");
    +    }
    +
    +    @Override
    +    public String getDescription() {
    +        return "HITS algorithm example";
    +    }
    +
    +    // *************************************************************************
    +    //     UTIL METHODS
    +    // *************************************************************************
    +
    +    private static boolean fileOutput = false;
    +    private static long numPages = 10;
    +    private static String edgeInputPath = null;
    +    private static String outputPath = null;
    +    private static int maxIterations = 5;
    +
    +    private static boolean parseParameters(String[] args) {
    +
    +        if(args.length > 0) {
    +            if(args.length != 3) {
    +                System.err.println("Usage: PageRank <input edges path> <output path> <num iterations>");
    --- End diff --
    
    It's no longer Page Rank ;) let's not confuse people!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/765#issuecomment-140314739
  
    Hi @mfahimazizi,
    
    as I said, it is not possible to access the edge values inside the `VertexUpdateFunction`.
    However, you can get the same functionality if you add the edge value inside the message that you create in the `MessagingFunction`. Alternatively, you can build your own delta iteration, instead of using the vertex-centric model.
    If you're planning to finish this PR, then let us know and we can even sync on skype or so to help you out! Otherwise, please close this PR and hopefully someone else will pick up this issue.
    Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/765#issuecomment-140669850
  
    Hey @mfahimazizi, thank you for your reply!
    Could you give me some details on what you mean by "calculate edges values"? What is the computation that you want to do and why can't you include it in the messaging function? Do you want to update the edge values or simply read them?
    Let me know if I can help :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by andralungu <gi...@git.apache.org>.
Github user andralungu commented on the pull request:

    https://github.com/apache/flink/pull/765#issuecomment-108634942
  
    Hi @mfahimazizi, 
    
    The common practice is to prefix the commit with the Jira issue, e.g. [FLINK-20XX][gelly]. Also, you should play with your IDE's settings. Right now, indentation is performed with spaces and it should be with tabs. This is why Travis is failing. To make sure the code works do a cd flink-staging/flink-gelly and then mvn verify. A success there equals a travis success 98% of the time :)
    
    After implementing an algorithm, it's always good to check whether it works correctly by writing a test (check the test/example/ folder :) ). 
    I would also update the documentation here. In the Library method section add an entry "*HITS", perhaps with the full name of the acronym as well. 
    
    I left some comments in-line. Overall this does not look bad at all!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by hsaputra <gi...@git.apache.org>.
Github user hsaputra commented on a diff in the pull request:

    https://github.com/apache/flink/pull/765#discussion_r31596508
  
    --- Diff: flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/example/HITSExample.java ---
    @@ -0,0 +1,113 @@
    +package org.apache.flink.graph.example;
    --- End diff --
    
    Missing Apache license header


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by mfahimazizi <gi...@git.apache.org>.
Github user mfahimazizi closed the pull request at:

    https://github.com/apache/flink/pull/765


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by hsaputra <gi...@git.apache.org>.
Github user hsaputra commented on a diff in the pull request:

    https://github.com/apache/flink/pull/765#discussion_r31596533
  
    --- Diff: flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/utils/Hits.java ---
    @@ -0,0 +1,14 @@
    +package org.apache.flink.graph.utils;
    --- End diff --
    
    Missing Apache license header


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Hits

Posted by andralungu <gi...@git.apache.org>.
Github user andralungu commented on the pull request:

    https://github.com/apache/flink/pull/765#issuecomment-130016029
  
    @vasia , I guess the guys already finished the IMPRO3 course. 
    
    What is the status of this PR @mfahimazizi? Do you have time to address the comments? Are you still interested in contributing? 
    
    We would be happy to answer your questions :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---