You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Greg Hogan <co...@greghogan.com> on 2017/02/23 20:50:25 UTC

Re: Apache Flink 1.1.4 - Gelly - LocalClusteringCoefficient - Returning values above 1?

Miguel and Vasia,

My thought is to change the example drivers to "print" verbose strings to
the console, for example:
Vertex ID: 0, vertex degree: 42, triangle count: 7, local clustering
coefficient: 0.00406504

Output to CSV will still be the compact tuple representations which do not
include derived scores.

Also, getUndirected only creates a flipped duplicate of each edge. There
are directed and undirected "Simplify" algorithms that also remove the
duplicates.

Greg

On Mon, Jan 23, 2017 at 5:40 AM, Vasiliki Kalavri <vasilikikalavri@gmail.com
> wrote:

> Hi Miguel,
>
> I don't think you're doing anything wrong. The NaN values you are getting
> are there because of your data. The LCC value is computed as
> #number_of_triangles / #number_of_triples, where #number_of_triples is
> [n*(n-1)]/2 for a vertex with n neighbors. It looks like there are no
> triangles in your graph, and only one vertex has more than one neighbor.
>
> Cheers,
> -Vasia.
>
> On 21 January 2017 at 16:46, Miguel Coimbra <mi...@gmail.com>
> wrote:
>
>> Hello Vasia and Greg,
>>
>> Thank you for the feedback!
>>
>> I am probably misusing the Gelly API in some way, but I thought I could
>> run the undirected version after calling getUndirected()?
>> While not going into the concept of local clustering coefficients, I
>> thought that from a Gelly API point-of-view, both my code and data set were
>> properly established.
>> However:
>> - I believe that the graph was already undirected;
>> - I am getting NaN results after executing the algorithm.
>>
>> This is the code I am using to obtain an (undirected) graph instance upon
>> which I call LocalClusteringCoefficient:
>>
>>
>> import org.apache.flink.graph.library.clustering.undirected.LocalCl
>> usteringCoefficient.Result;
>> import org.apache.flink.graph.library.clustering.undirected.LocalCl
>> usteringCoefficient;
>> /** other imports and method definitions **/
>>
>> // Generate edge tuples from the input file.
>> final DataSet<Tuple2<LongValue, LongValue>> edgeTuples =
>> env.readCsvFile(inputPath)
>>     .fieldDelimiter("\t") // node IDs are separated by spaces
>>     .ignoreComments("#")  // comments start with "%"
>>     .types(LongValue.class, LongValue.class);
>>
>> // Generate actual Edge<Long, Double> instances.
>> @SuppressWarnings("serial")
>> final DataSet<Edge<LongValue, Double>> edges = edgeTuples.map(
>>     new MapFunction<Tuple2<LongValue, LongValue>, Edge<LongValue,
>> Double>>() {
>>         @Override
>>         public Edge<LongValue, Double> map(Tuple2<LongValue, LongValue>
>> arg0) throws Exception {
>>             return new Edge<LongValue, Double>(arg0.f0,  arg0.f1, 1.0d);
>>         }
>>     });
>>
>> // Generate the basic graph.
>> @SuppressWarnings("serial")
>> final Graph<LongValue, Double, Double> graph = Graph.fromDataSet(
>>     edges,
>>     new MapFunction<LongValue, Double>() {
>>         @Override
>>         public Double map(LongValue arg0) throws Exception {
>>             // For testing purposes, just setting each vertex value to
>> 1.0.
>>             return 1.0;
>>         }
>>     },
>>     env).*getUndirected(*);
>>
>> // Execute the LocalClusteringCoefficient algorithm.
>> final DataSet<Result<LongValue>> localClusteringCoefficients =
>> graph.run(new LocalClusteringCoefficient<LongValue, Double, Double>());
>>
>> // Get the values as per Vasia's help:
>> @SuppressWarnings("serial")
>> DataSet<Double> *CLUSTERING_COEFFICIENTS* =
>> localClusteringCoefficients.map(new MapFunction<Result<LongValue>,
>> Double>() {
>>     @Override
>>     public Double map(Result<LongValue> arg0) throws Exception {
>>         return arg0.getLocalClusteringCoefficientScore();
>>     }
>> });
>>
>> I believe this is the correct way to get a DataSet<Double> of
>> coefficients from a DataSet<Result<LongValue>> ?
>> Among the coefficients are a lot of NaN values:
>>
>> *CLUSTERING_COEFFICIENTS*.print();
>>
>> NaN
>> 0.0
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>>
>> Apologies for the verbosity in advance, but just to provide detail,
>> printing the graph edges yields this (notice that for each pair or vertices
>> there are two links, which are original and the reverse version derived
>> from getUndirected()).
>>
>> *Greg:* I therefore believe the *graph is undirected*:
>>
>> graph.getEdgesAsTuple3().print();
>> (5113,6008,1.0)
>> (6008,5113,1.0)
>> (5113,6774,1.0)
>> (6774,5113,1.0)
>> (5113,32938,1.0)
>> (32938,5113,1.0)
>> (5113,6545,1.0)
>> (6545,5113,1.0)
>> (5113,7088,1.0)
>> (7088,5113,1.0)
>> (5113,37929,1.0)
>> (37929,5113,1.0)
>> (5113,26562,1.0)
>> (26562,5113,1.0)
>> (5113,6107,1.0)
>> (6107,5113,1.0)
>> (5113,7171,1.0)
>> (7171,5113,1.0)
>> (5113,6192,1.0)
>> (6192,5113,1.0)
>> (5113,7763,1.0)
>> (7763,5113,1.0)
>> (9748,5113,1.0)
>> (5113,9748,1.0)
>> (10191,5113,1.0)
>> (5113,10191,1.0)
>> (6064,5113,1.0)
>> (5113,6064,1.0)
>> (6065,5113,1.0)
>> (5113,6065,1.0)
>> (6279,5113,1.0)
>> (5113,6279,1.0)
>> (4907,5113,1.0)
>> (5113,4907,1.0)
>> (6465,5113,1.0)
>> (5113,6465,1.0)
>> (6707,5113,1.0)
>> (5113,6707,1.0)
>> (7089,5113,1.0)
>> (5113,7089,1.0)
>> (7172,5113,1.0)
>> (5113,7172,1.0)
>> (14310,5113,1.0)
>> (5113,14310,1.0)
>> (6252,5113,1.0)
>> (5113,6252,1.0)
>> (33855,5113,1.0)
>> (5113,33855,1.0)
>> (7976,5113,1.0)
>> (5113,7976,1.0)
>> (26284,5113,1.0)
>> (5113,26284,1.0)
>> (8056,5113,1.0)
>> (5113,8056,1.0)
>> (10371,5113,1.0)
>> (5113,10371,1.0)
>> (16785,5113,1.0)
>> (5113,16785,1.0)
>> (19801,5113,1.0)
>> (5113,19801,1.0)
>> (6715,5113,1.0)
>> (5113,6715,1.0)
>> (31724,5113,1.0)
>> (5113,31724,1.0)
>> (32443,5113,1.0)
>> (5113,32443,1.0)
>> (10370,5113,1.0)
>> (5113,10370,1.0)
>>
>> Any insight into what I may be doing wrong would be greatly appreciated.
>>
>> Thanks for your time,
>>
>> Kind regards,
>>
>> Miguel E. Coimbra
>> Email: miguel.e.coimbra@gmail.com <mi...@ist.utl.pt>
>> Skype: miguel.e.coimbra
>>
>> On 20 January 2017 at 19:31, Greg Hogan <co...@greghogan.com> wrote:
>>
>>> Hi Miguel,
>>>
>>> The '--output print' option describes the values and also displays the
>>> local clustering coefficient value.
>>>
>>> You're running the undirected algorithm on a directed graph. In 1.2
>>> there is an option '--simplify true' that will add reverse edges and remove
>>> duplicate edges and self-loops. Alternatively, it looks like you could
>>> simply add reverse edges to your input file (with an optional ' | sort |
>>> uniq' following):
>>>
>>> $ cat edges.txt | awk ' { print $1, $2; print $2, $1 } '
>>>
>>> The drivers are being reworked for 1.3 to better reuse code and options
>>> which will better support additional drivers and algorithms and make
>>> documentation simpler.
>>>
>>> Greg
>>>
>>> On Fri, Jan 20, 2017 at 2:06 PM, Vasiliki Kalavri <
>>> vasilikikalavri@gmail.com> wrote:
>>>
>>>> Hi Miguel,
>>>>
>>>> the LocalClusteringCoefficient algorithm returns a DataSet of type
>>>> Result, which basically wraps a vertex id, its degree, and the number
>>>> of triangles containing this vertex. The number 11 you see is indeed the
>>>> degree of vertex 5113. The Result type contains the method
>>>> getLocalClusteringCoefficientScore() which allows you to retrieve the
>>>> clustering coefficient score for a vertex. The method simply divides the
>>>> numbers of triangles by the number of potential edges between neighbors.
>>>>
>>>> I'm sorry that you this is not clear in the docs. We should definitely
>>>> improve them to explain what is the output and how to retrieve the actual
>>>> clustering coefficient values. I have opened a JIRA for this [1].
>>>>
>>>> Cheers,
>>>> -Vasia.
>>>>
>>>> [1]: https://issues.apache.org/jira/browse/FLINK-5597
>>>>
>>>> On 20 January 2017 at 19:31, Miguel Coimbra <miguel.e.coimbra@gmail.com
>>>> > wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> In the documentation of the LocalClusteringCoefficient algorithm, it
>>>>> is said:
>>>>>
>>>>>
>>>>> *The local clustering coefficient measures the connectedness of each
>>>>> vertex’s neighborhood.Scores range from 0.0 (no edges between neighbors) to
>>>>> 1.0 (neighborhood is a clique).*
>>>>>
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.1/
>>>>> apis/batch/libs/gelly.html#local-clustering-coefficient
>>>>> <https://ci.apache.org/projects/flink/flink-docs-master/dev/libs/gelly/library_methods.html#local-clustering-coefficient>
>>>>>
>>>>> However, upon running the algorithm (undirected version), I obtained
>>>>> values above 1.
>>>>>
>>>>> The result I got was this. As you can see, vertex 5113 has a score of
>>>>> 11:
>>>>> (the input edges for the graph are shown further below - around *35
>>>>> edges*):
>>>>>
>>>>> (4907,(1,0))
>>>>> *(5113,(11,0))*
>>>>> (6008,(0,0))
>>>>> (6064,(1,0))
>>>>> (6065,(1,0))
>>>>> (6107,(0,0))
>>>>> (6192,(0,0))
>>>>> (6252,(1,0))
>>>>> (6279,(1,0))
>>>>> (6465,(1,0))
>>>>> (6545,(0,0))
>>>>> (6707,(1,0))
>>>>> (6715,(1,0))
>>>>> (6774,(0,0))
>>>>> (7088,(0,0))
>>>>> (7089,(1,0))
>>>>> (7171,(0,0))
>>>>> (7172,(1,0))
>>>>> (7763,(0,0))
>>>>> (7976,(1,0))
>>>>> (8056,(1,0))
>>>>> (9748,(1,0))
>>>>> (10191,(1,0))
>>>>> (10370,(1,0))
>>>>> (10371,(1,0))
>>>>> (14310,(1,0))
>>>>> (16785,(1,0))
>>>>> (19801,(1,0))
>>>>> (26284,(1,0))
>>>>> (26562,(0,0))
>>>>> (31724,(1,0))
>>>>> (32443,(1,0))
>>>>> (32938,(0,0))
>>>>> (33855,(1,0))
>>>>> (37929,(0,0))
>>>>>
>>>>> This was from a small isolated test with these edges:
>>>>>
>>>>> 5113    6008
>>>>> 5113    6774
>>>>> 5113    32938
>>>>> 5113    6545
>>>>> 5113    7088
>>>>> 5113    37929
>>>>> 5113    26562
>>>>> 5113    6107
>>>>> 5113    7171
>>>>> 5113    6192
>>>>> 5113    7763
>>>>> 9748    5113
>>>>> 10191    5113
>>>>> 6064    5113
>>>>> 6065    5113
>>>>> 6279    5113
>>>>> 4907    5113
>>>>> 6465    5113
>>>>> 6707    5113
>>>>> 7089    5113
>>>>> 7172    5113
>>>>> 14310    5113
>>>>> 6252    5113
>>>>> 33855    5113
>>>>> 7976    5113
>>>>> 26284    5113 <262%20845%20113>
>>>>> 8056    5113
>>>>> 10371    5113
>>>>> 16785    5113
>>>>> 19801    5113
>>>>> 6715    5113
>>>>> 31724    5113
>>>>> 32443    5113
>>>>> 10370    5113
>>>>>
>>>>> I am not sure what I may be doing wrong, but is there perhaps some
>>>>> form of normalization lacking in my execution of:
>>>>>
>>>>> org.apache.flink.graph.library.clustering.undirected.LocalCl
>>>>> usteringCoefficient.Result;
>>>>> org.apache.flink.graph.library.clustering.undirected.LocalCl
>>>>> usteringCoefficient;
>>>>>
>>>>> Am I supposed to divide all scores by the greatest score obtained by
>>>>> the algorithm?
>>>>>
>>>>> Thank you very much!
>>>>>
>>>>> Miguel E. Coimbra
>>>>> Email: miguel.e.coimbra@gmail.com <mi...@ist.utl.pt>
>>>>> Skype: miguel.e.coimbra
>>>>>
>>>>
>>>>
>>>
>>
>