You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Miguel Coimbra <mi...@gmail.com> on 2017/01/20 18:31:49 UTC

Apache Flink 1.1.4 - Gelly - LocalClusteringCoefficient - Returning values above 1?

Hello,

In the documentation of the LocalClusteringCoefficient algorithm, it is
said:


*The local clustering coefficient measures the connectedness of each
vertex’s neighborhood.Scores range from 0.0 (no edges between neighbors) to
1.0 (neighborhood is a clique).*

https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/batch/libs/
gelly.html#local-clustering-coefficient
<https://ci.apache.org/projects/flink/flink-docs-master/dev/libs/gelly/library_methods.html#local-clustering-coefficient>

However, upon running the algorithm (undirected version), I obtained values
above 1.

The result I got was this. As you can see, vertex 5113 has a score of 11:
(the input edges for the graph are shown further below - around *35 edges*):

(4907,(1,0))
*(5113,(11,0))*
(6008,(0,0))
(6064,(1,0))
(6065,(1,0))
(6107,(0,0))
(6192,(0,0))
(6252,(1,0))
(6279,(1,0))
(6465,(1,0))
(6545,(0,0))
(6707,(1,0))
(6715,(1,0))
(6774,(0,0))
(7088,(0,0))
(7089,(1,0))
(7171,(0,0))
(7172,(1,0))
(7763,(0,0))
(7976,(1,0))
(8056,(1,0))
(9748,(1,0))
(10191,(1,0))
(10370,(1,0))
(10371,(1,0))
(14310,(1,0))
(16785,(1,0))
(19801,(1,0))
(26284,(1,0))
(26562,(0,0))
(31724,(1,0))
(32443,(1,0))
(32938,(0,0))
(33855,(1,0))
(37929,(0,0))

This was from a small isolated test with these edges:

5113    6008
5113    6774
5113    32938
5113    6545
5113    7088
5113    37929
5113    26562
5113    6107
5113    7171
5113    6192
5113    7763
9748    5113
10191    5113
6064    5113
6065    5113
6279    5113
4907    5113
6465    5113
6707    5113
7089    5113
7172    5113
14310    5113
6252    5113
33855    5113
7976    5113
26284    5113
8056    5113
10371    5113
16785    5113
19801    5113
6715    5113
31724    5113
32443    5113
10370    5113

I am not sure what I may be doing wrong, but is there perhaps some form of
normalization lacking in my execution of:

org.apache.flink.graph.library.clustering.undirected.LocalClusteringCoefficient.Result;
org.apache.flink.graph.library.clustering.undirected.LocalClusteringCoefficient;

Am I supposed to divide all scores by the greatest score obtained by the
algorithm?

Thank you very much!

Miguel E. Coimbra
Email: miguel.e.coimbra@gmail.com <mi...@ist.utl.pt>
Skype: miguel.e.coimbra

Re: Apache Flink 1.1.4 - Gelly - LocalClusteringCoefficient - Returning values above 1?

Posted by Greg Hogan <co...@greghogan.com>.
Miguel and Vasia,

My thought is to change the example drivers to "print" verbose strings to
the console, for example:
Vertex ID: 0, vertex degree: 42, triangle count: 7, local clustering
coefficient: 0.00406504

Output to CSV will still be the compact tuple representations which do not
include derived scores.

Also, getUndirected only creates a flipped duplicate of each edge. There
are directed and undirected "Simplify" algorithms that also remove the
duplicates.

Greg

On Mon, Jan 23, 2017 at 5:40 AM, Vasiliki Kalavri <vasilikikalavri@gmail.com
> wrote:

> Hi Miguel,
>
> I don't think you're doing anything wrong. The NaN values you are getting
> are there because of your data. The LCC value is computed as
> #number_of_triangles / #number_of_triples, where #number_of_triples is
> [n*(n-1)]/2 for a vertex with n neighbors. It looks like there are no
> triangles in your graph, and only one vertex has more than one neighbor.
>
> Cheers,
> -Vasia.
>
> On 21 January 2017 at 16:46, Miguel Coimbra <mi...@gmail.com>
> wrote:
>
>> Hello Vasia and Greg,
>>
>> Thank you for the feedback!
>>
>> I am probably misusing the Gelly API in some way, but I thought I could
>> run the undirected version after calling getUndirected()?
>> While not going into the concept of local clustering coefficients, I
>> thought that from a Gelly API point-of-view, both my code and data set were
>> properly established.
>> However:
>> - I believe that the graph was already undirected;
>> - I am getting NaN results after executing the algorithm.
>>
>> This is the code I am using to obtain an (undirected) graph instance upon
>> which I call LocalClusteringCoefficient:
>>
>>
>> import org.apache.flink.graph.library.clustering.undirected.LocalCl
>> usteringCoefficient.Result;
>> import org.apache.flink.graph.library.clustering.undirected.LocalCl
>> usteringCoefficient;
>> /** other imports and method definitions **/
>>
>> // Generate edge tuples from the input file.
>> final DataSet<Tuple2<LongValue, LongValue>> edgeTuples =
>> env.readCsvFile(inputPath)
>>     .fieldDelimiter("\t") // node IDs are separated by spaces
>>     .ignoreComments("#")  // comments start with "%"
>>     .types(LongValue.class, LongValue.class);
>>
>> // Generate actual Edge<Long, Double> instances.
>> @SuppressWarnings("serial")
>> final DataSet<Edge<LongValue, Double>> edges = edgeTuples.map(
>>     new MapFunction<Tuple2<LongValue, LongValue>, Edge<LongValue,
>> Double>>() {
>>         @Override
>>         public Edge<LongValue, Double> map(Tuple2<LongValue, LongValue>
>> arg0) throws Exception {
>>             return new Edge<LongValue, Double>(arg0.f0,  arg0.f1, 1.0d);
>>         }
>>     });
>>
>> // Generate the basic graph.
>> @SuppressWarnings("serial")
>> final Graph<LongValue, Double, Double> graph = Graph.fromDataSet(
>>     edges,
>>     new MapFunction<LongValue, Double>() {
>>         @Override
>>         public Double map(LongValue arg0) throws Exception {
>>             // For testing purposes, just setting each vertex value to
>> 1.0.
>>             return 1.0;
>>         }
>>     },
>>     env).*getUndirected(*);
>>
>> // Execute the LocalClusteringCoefficient algorithm.
>> final DataSet<Result<LongValue>> localClusteringCoefficients =
>> graph.run(new LocalClusteringCoefficient<LongValue, Double, Double>());
>>
>> // Get the values as per Vasia's help:
>> @SuppressWarnings("serial")
>> DataSet<Double> *CLUSTERING_COEFFICIENTS* =
>> localClusteringCoefficients.map(new MapFunction<Result<LongValue>,
>> Double>() {
>>     @Override
>>     public Double map(Result<LongValue> arg0) throws Exception {
>>         return arg0.getLocalClusteringCoefficientScore();
>>     }
>> });
>>
>> I believe this is the correct way to get a DataSet<Double> of
>> coefficients from a DataSet<Result<LongValue>> ?
>> Among the coefficients are a lot of NaN values:
>>
>> *CLUSTERING_COEFFICIENTS*.print();
>>
>> NaN
>> 0.0
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>> NaN
>>
>> Apologies for the verbosity in advance, but just to provide detail,
>> printing the graph edges yields this (notice that for each pair or vertices
>> there are two links, which are original and the reverse version derived
>> from getUndirected()).
>>
>> *Greg:* I therefore believe the *graph is undirected*:
>>
>> graph.getEdgesAsTuple3().print();
>> (5113,6008,1.0)
>> (6008,5113,1.0)
>> (5113,6774,1.0)
>> (6774,5113,1.0)
>> (5113,32938,1.0)
>> (32938,5113,1.0)
>> (5113,6545,1.0)
>> (6545,5113,1.0)
>> (5113,7088,1.0)
>> (7088,5113,1.0)
>> (5113,37929,1.0)
>> (37929,5113,1.0)
>> (5113,26562,1.0)
>> (26562,5113,1.0)
>> (5113,6107,1.0)
>> (6107,5113,1.0)
>> (5113,7171,1.0)
>> (7171,5113,1.0)
>> (5113,6192,1.0)
>> (6192,5113,1.0)
>> (5113,7763,1.0)
>> (7763,5113,1.0)
>> (9748,5113,1.0)
>> (5113,9748,1.0)
>> (10191,5113,1.0)
>> (5113,10191,1.0)
>> (6064,5113,1.0)
>> (5113,6064,1.0)
>> (6065,5113,1.0)
>> (5113,6065,1.0)
>> (6279,5113,1.0)
>> (5113,6279,1.0)
>> (4907,5113,1.0)
>> (5113,4907,1.0)
>> (6465,5113,1.0)
>> (5113,6465,1.0)
>> (6707,5113,1.0)
>> (5113,6707,1.0)
>> (7089,5113,1.0)
>> (5113,7089,1.0)
>> (7172,5113,1.0)
>> (5113,7172,1.0)
>> (14310,5113,1.0)
>> (5113,14310,1.0)
>> (6252,5113,1.0)
>> (5113,6252,1.0)
>> (33855,5113,1.0)
>> (5113,33855,1.0)
>> (7976,5113,1.0)
>> (5113,7976,1.0)
>> (26284,5113,1.0)
>> (5113,26284,1.0)
>> (8056,5113,1.0)
>> (5113,8056,1.0)
>> (10371,5113,1.0)
>> (5113,10371,1.0)
>> (16785,5113,1.0)
>> (5113,16785,1.0)
>> (19801,5113,1.0)
>> (5113,19801,1.0)
>> (6715,5113,1.0)
>> (5113,6715,1.0)
>> (31724,5113,1.0)
>> (5113,31724,1.0)
>> (32443,5113,1.0)
>> (5113,32443,1.0)
>> (10370,5113,1.0)
>> (5113,10370,1.0)
>>
>> Any insight into what I may be doing wrong would be greatly appreciated.
>>
>> Thanks for your time,
>>
>> Kind regards,
>>
>> Miguel E. Coimbra
>> Email: miguel.e.coimbra@gmail.com <mi...@ist.utl.pt>
>> Skype: miguel.e.coimbra
>>
>> On 20 January 2017 at 19:31, Greg Hogan <co...@greghogan.com> wrote:
>>
>>> Hi Miguel,
>>>
>>> The '--output print' option describes the values and also displays the
>>> local clustering coefficient value.
>>>
>>> You're running the undirected algorithm on a directed graph. In 1.2
>>> there is an option '--simplify true' that will add reverse edges and remove
>>> duplicate edges and self-loops. Alternatively, it looks like you could
>>> simply add reverse edges to your input file (with an optional ' | sort |
>>> uniq' following):
>>>
>>> $ cat edges.txt | awk ' { print $1, $2; print $2, $1 } '
>>>
>>> The drivers are being reworked for 1.3 to better reuse code and options
>>> which will better support additional drivers and algorithms and make
>>> documentation simpler.
>>>
>>> Greg
>>>
>>> On Fri, Jan 20, 2017 at 2:06 PM, Vasiliki Kalavri <
>>> vasilikikalavri@gmail.com> wrote:
>>>
>>>> Hi Miguel,
>>>>
>>>> the LocalClusteringCoefficient algorithm returns a DataSet of type
>>>> Result, which basically wraps a vertex id, its degree, and the number
>>>> of triangles containing this vertex. The number 11 you see is indeed the
>>>> degree of vertex 5113. The Result type contains the method
>>>> getLocalClusteringCoefficientScore() which allows you to retrieve the
>>>> clustering coefficient score for a vertex. The method simply divides the
>>>> numbers of triangles by the number of potential edges between neighbors.
>>>>
>>>> I'm sorry that you this is not clear in the docs. We should definitely
>>>> improve them to explain what is the output and how to retrieve the actual
>>>> clustering coefficient values. I have opened a JIRA for this [1].
>>>>
>>>> Cheers,
>>>> -Vasia.
>>>>
>>>> [1]: https://issues.apache.org/jira/browse/FLINK-5597
>>>>
>>>> On 20 January 2017 at 19:31, Miguel Coimbra <miguel.e.coimbra@gmail.com
>>>> > wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> In the documentation of the LocalClusteringCoefficient algorithm, it
>>>>> is said:
>>>>>
>>>>>
>>>>> *The local clustering coefficient measures the connectedness of each
>>>>> vertex’s neighborhood.Scores range from 0.0 (no edges between neighbors) to
>>>>> 1.0 (neighborhood is a clique).*
>>>>>
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.1/
>>>>> apis/batch/libs/gelly.html#local-clustering-coefficient
>>>>> <https://ci.apache.org/projects/flink/flink-docs-master/dev/libs/gelly/library_methods.html#local-clustering-coefficient>
>>>>>
>>>>> However, upon running the algorithm (undirected version), I obtained
>>>>> values above 1.
>>>>>
>>>>> The result I got was this. As you can see, vertex 5113 has a score of
>>>>> 11:
>>>>> (the input edges for the graph are shown further below - around *35
>>>>> edges*):
>>>>>
>>>>> (4907,(1,0))
>>>>> *(5113,(11,0))*
>>>>> (6008,(0,0))
>>>>> (6064,(1,0))
>>>>> (6065,(1,0))
>>>>> (6107,(0,0))
>>>>> (6192,(0,0))
>>>>> (6252,(1,0))
>>>>> (6279,(1,0))
>>>>> (6465,(1,0))
>>>>> (6545,(0,0))
>>>>> (6707,(1,0))
>>>>> (6715,(1,0))
>>>>> (6774,(0,0))
>>>>> (7088,(0,0))
>>>>> (7089,(1,0))
>>>>> (7171,(0,0))
>>>>> (7172,(1,0))
>>>>> (7763,(0,0))
>>>>> (7976,(1,0))
>>>>> (8056,(1,0))
>>>>> (9748,(1,0))
>>>>> (10191,(1,0))
>>>>> (10370,(1,0))
>>>>> (10371,(1,0))
>>>>> (14310,(1,0))
>>>>> (16785,(1,0))
>>>>> (19801,(1,0))
>>>>> (26284,(1,0))
>>>>> (26562,(0,0))
>>>>> (31724,(1,0))
>>>>> (32443,(1,0))
>>>>> (32938,(0,0))
>>>>> (33855,(1,0))
>>>>> (37929,(0,0))
>>>>>
>>>>> This was from a small isolated test with these edges:
>>>>>
>>>>> 5113    6008
>>>>> 5113    6774
>>>>> 5113    32938
>>>>> 5113    6545
>>>>> 5113    7088
>>>>> 5113    37929
>>>>> 5113    26562
>>>>> 5113    6107
>>>>> 5113    7171
>>>>> 5113    6192
>>>>> 5113    7763
>>>>> 9748    5113
>>>>> 10191    5113
>>>>> 6064    5113
>>>>> 6065    5113
>>>>> 6279    5113
>>>>> 4907    5113
>>>>> 6465    5113
>>>>> 6707    5113
>>>>> 7089    5113
>>>>> 7172    5113
>>>>> 14310    5113
>>>>> 6252    5113
>>>>> 33855    5113
>>>>> 7976    5113
>>>>> 26284    5113 <262%20845%20113>
>>>>> 8056    5113
>>>>> 10371    5113
>>>>> 16785    5113
>>>>> 19801    5113
>>>>> 6715    5113
>>>>> 31724    5113
>>>>> 32443    5113
>>>>> 10370    5113
>>>>>
>>>>> I am not sure what I may be doing wrong, but is there perhaps some
>>>>> form of normalization lacking in my execution of:
>>>>>
>>>>> org.apache.flink.graph.library.clustering.undirected.LocalCl
>>>>> usteringCoefficient.Result;
>>>>> org.apache.flink.graph.library.clustering.undirected.LocalCl
>>>>> usteringCoefficient;
>>>>>
>>>>> Am I supposed to divide all scores by the greatest score obtained by
>>>>> the algorithm?
>>>>>
>>>>> Thank you very much!
>>>>>
>>>>> Miguel E. Coimbra
>>>>> Email: miguel.e.coimbra@gmail.com <mi...@ist.utl.pt>
>>>>> Skype: miguel.e.coimbra
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Apache Flink 1.1.4 - Gelly - LocalClusteringCoefficient - Returning values above 1?

Posted by Vasiliki Kalavri <va...@gmail.com>.
Hi Miguel,

I don't think you're doing anything wrong. The NaN values you are getting
are there because of your data. The LCC value is computed as
#number_of_triangles / #number_of_triples, where #number_of_triples is
[n*(n-1)]/2 for a vertex with n neighbors. It looks like there are no
triangles in your graph, and only one vertex has more than one neighbor.

Cheers,
-Vasia.

On 21 January 2017 at 16:46, Miguel Coimbra <mi...@gmail.com>
wrote:

> Hello Vasia and Greg,
>
> Thank you for the feedback!
>
> I am probably misusing the Gelly API in some way, but I thought I could
> run the undirected version after calling getUndirected()?
> While not going into the concept of local clustering coefficients, I
> thought that from a Gelly API point-of-view, both my code and data set were
> properly established.
> However:
> - I believe that the graph was already undirected;
> - I am getting NaN results after executing the algorithm.
>
> This is the code I am using to obtain an (undirected) graph instance upon
> which I call LocalClusteringCoefficient:
>
>
> import org.apache.flink.graph.library.clustering.undirected.
> LocalClusteringCoefficient.Result;
> import org.apache.flink.graph.library.clustering.undirected.
> LocalClusteringCoefficient;
> /** other imports and method definitions **/
>
> // Generate edge tuples from the input file.
> final DataSet<Tuple2<LongValue, LongValue>> edgeTuples =
> env.readCsvFile(inputPath)
>     .fieldDelimiter("\t") // node IDs are separated by spaces
>     .ignoreComments("#")  // comments start with "%"
>     .types(LongValue.class, LongValue.class);
>
> // Generate actual Edge<Long, Double> instances.
> @SuppressWarnings("serial")
> final DataSet<Edge<LongValue, Double>> edges = edgeTuples.map(
>     new MapFunction<Tuple2<LongValue, LongValue>, Edge<LongValue,
> Double>>() {
>         @Override
>         public Edge<LongValue, Double> map(Tuple2<LongValue, LongValue>
> arg0) throws Exception {
>             return new Edge<LongValue, Double>(arg0.f0,  arg0.f1, 1.0d);
>         }
>     });
>
> // Generate the basic graph.
> @SuppressWarnings("serial")
> final Graph<LongValue, Double, Double> graph = Graph.fromDataSet(
>     edges,
>     new MapFunction<LongValue, Double>() {
>         @Override
>         public Double map(LongValue arg0) throws Exception {
>             // For testing purposes, just setting each vertex value to 1.0.
>             return 1.0;
>         }
>     },
>     env).*getUndirected(*);
>
> // Execute the LocalClusteringCoefficient algorithm.
> final DataSet<Result<LongValue>> localClusteringCoefficients =
> graph.run(new LocalClusteringCoefficient<LongValue, Double, Double>());
>
> // Get the values as per Vasia's help:
> @SuppressWarnings("serial")
> DataSet<Double> *CLUSTERING_COEFFICIENTS* = localClusteringCoefficients.map(new
> MapFunction<Result<LongValue>, Double>() {
>     @Override
>     public Double map(Result<LongValue> arg0) throws Exception {
>         return arg0.getLocalClusteringCoefficientScore();
>     }
> });
>
> I believe this is the correct way to get a DataSet<Double> of
> coefficients from a DataSet<Result<LongValue>> ?
> Among the coefficients are a lot of NaN values:
>
> *CLUSTERING_COEFFICIENTS*.print();
>
> NaN
> 0.0
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
> NaN
>
> Apologies for the verbosity in advance, but just to provide detail,
> printing the graph edges yields this (notice that for each pair or vertices
> there are two links, which are original and the reverse version derived
> from getUndirected()).
>
> *Greg:* I therefore believe the *graph is undirected*:
>
> graph.getEdgesAsTuple3().print();
> (5113,6008,1.0)
> (6008,5113,1.0)
> (5113,6774,1.0)
> (6774,5113,1.0)
> (5113,32938,1.0)
> (32938,5113,1.0)
> (5113,6545,1.0)
> (6545,5113,1.0)
> (5113,7088,1.0)
> (7088,5113,1.0)
> (5113,37929,1.0)
> (37929,5113,1.0)
> (5113,26562,1.0)
> (26562,5113,1.0)
> (5113,6107,1.0)
> (6107,5113,1.0)
> (5113,7171,1.0)
> (7171,5113,1.0)
> (5113,6192,1.0)
> (6192,5113,1.0)
> (5113,7763,1.0)
> (7763,5113,1.0)
> (9748,5113,1.0)
> (5113,9748,1.0)
> (10191,5113,1.0)
> (5113,10191,1.0)
> (6064,5113,1.0)
> (5113,6064,1.0)
> (6065,5113,1.0)
> (5113,6065,1.0)
> (6279,5113,1.0)
> (5113,6279,1.0)
> (4907,5113,1.0)
> (5113,4907,1.0)
> (6465,5113,1.0)
> (5113,6465,1.0)
> (6707,5113,1.0)
> (5113,6707,1.0)
> (7089,5113,1.0)
> (5113,7089,1.0)
> (7172,5113,1.0)
> (5113,7172,1.0)
> (14310,5113,1.0)
> (5113,14310,1.0)
> (6252,5113,1.0)
> (5113,6252,1.0)
> (33855,5113,1.0)
> (5113,33855,1.0)
> (7976,5113,1.0)
> (5113,7976,1.0)
> (26284,5113,1.0)
> (5113,26284,1.0)
> (8056,5113,1.0)
> (5113,8056,1.0)
> (10371,5113,1.0)
> (5113,10371,1.0)
> (16785,5113,1.0)
> (5113,16785,1.0)
> (19801,5113,1.0)
> (5113,19801,1.0)
> (6715,5113,1.0)
> (5113,6715,1.0)
> (31724,5113,1.0)
> (5113,31724,1.0)
> (32443,5113,1.0)
> (5113,32443,1.0)
> (10370,5113,1.0)
> (5113,10370,1.0)
>
> Any insight into what I may be doing wrong would be greatly appreciated.
>
> Thanks for your time,
>
> Kind regards,
>
> Miguel E. Coimbra
> Email: miguel.e.coimbra@gmail.com <mi...@ist.utl.pt>
> Skype: miguel.e.coimbra
>
> On 20 January 2017 at 19:31, Greg Hogan <co...@greghogan.com> wrote:
>
>> Hi Miguel,
>>
>> The '--output print' option describes the values and also displays the
>> local clustering coefficient value.
>>
>> You're running the undirected algorithm on a directed graph. In 1.2 there
>> is an option '--simplify true' that will add reverse edges and remove
>> duplicate edges and self-loops. Alternatively, it looks like you could
>> simply add reverse edges to your input file (with an optional ' | sort |
>> uniq' following):
>>
>> $ cat edges.txt | awk ' { print $1, $2; print $2, $1 } '
>>
>> The drivers are being reworked for 1.3 to better reuse code and options
>> which will better support additional drivers and algorithms and make
>> documentation simpler.
>>
>> Greg
>>
>> On Fri, Jan 20, 2017 at 2:06 PM, Vasiliki Kalavri <
>> vasilikikalavri@gmail.com> wrote:
>>
>>> Hi Miguel,
>>>
>>> the LocalClusteringCoefficient algorithm returns a DataSet of type
>>> Result, which basically wraps a vertex id, its degree, and the number
>>> of triangles containing this vertex. The number 11 you see is indeed the
>>> degree of vertex 5113. The Result type contains the method
>>> getLocalClusteringCoefficientScore() which allows you to retrieve the
>>> clustering coefficient score for a vertex. The method simply divides the
>>> numbers of triangles by the number of potential edges between neighbors.
>>>
>>> I'm sorry that you this is not clear in the docs. We should definitely
>>> improve them to explain what is the output and how to retrieve the actual
>>> clustering coefficient values. I have opened a JIRA for this [1].
>>>
>>> Cheers,
>>> -Vasia.
>>>
>>> [1]: https://issues.apache.org/jira/browse/FLINK-5597
>>>
>>> On 20 January 2017 at 19:31, Miguel Coimbra <mi...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> In the documentation of the LocalClusteringCoefficient algorithm, it
>>>> is said:
>>>>
>>>>
>>>> *The local clustering coefficient measures the connectedness of each
>>>> vertex’s neighborhood.Scores range from 0.0 (no edges between neighbors) to
>>>> 1.0 (neighborhood is a clique).*
>>>>
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.1/
>>>> apis/batch/libs/gelly.html#local-clustering-coefficient
>>>> <https://ci.apache.org/projects/flink/flink-docs-master/dev/libs/gelly/library_methods.html#local-clustering-coefficient>
>>>>
>>>> However, upon running the algorithm (undirected version), I obtained
>>>> values above 1.
>>>>
>>>> The result I got was this. As you can see, vertex 5113 has a score of
>>>> 11:
>>>> (the input edges for the graph are shown further below - around *35
>>>> edges*):
>>>>
>>>> (4907,(1,0))
>>>> *(5113,(11,0))*
>>>> (6008,(0,0))
>>>> (6064,(1,0))
>>>> (6065,(1,0))
>>>> (6107,(0,0))
>>>> (6192,(0,0))
>>>> (6252,(1,0))
>>>> (6279,(1,0))
>>>> (6465,(1,0))
>>>> (6545,(0,0))
>>>> (6707,(1,0))
>>>> (6715,(1,0))
>>>> (6774,(0,0))
>>>> (7088,(0,0))
>>>> (7089,(1,0))
>>>> (7171,(0,0))
>>>> (7172,(1,0))
>>>> (7763,(0,0))
>>>> (7976,(1,0))
>>>> (8056,(1,0))
>>>> (9748,(1,0))
>>>> (10191,(1,0))
>>>> (10370,(1,0))
>>>> (10371,(1,0))
>>>> (14310,(1,0))
>>>> (16785,(1,0))
>>>> (19801,(1,0))
>>>> (26284,(1,0))
>>>> (26562,(0,0))
>>>> (31724,(1,0))
>>>> (32443,(1,0))
>>>> (32938,(0,0))
>>>> (33855,(1,0))
>>>> (37929,(0,0))
>>>>
>>>> This was from a small isolated test with these edges:
>>>>
>>>> 5113    6008
>>>> 5113    6774
>>>> 5113    32938
>>>> 5113    6545
>>>> 5113    7088
>>>> 5113    37929
>>>> 5113    26562
>>>> 5113    6107
>>>> 5113    7171
>>>> 5113    6192
>>>> 5113    7763
>>>> 9748    5113
>>>> 10191    5113
>>>> 6064    5113
>>>> 6065    5113
>>>> 6279    5113
>>>> 4907    5113
>>>> 6465    5113
>>>> 6707    5113
>>>> 7089    5113
>>>> 7172    5113
>>>> 14310    5113
>>>> 6252    5113
>>>> 33855    5113
>>>> 7976    5113
>>>> 26284    5113 <262%20845%20113>
>>>> 8056    5113
>>>> 10371    5113
>>>> 16785    5113
>>>> 19801    5113
>>>> 6715    5113
>>>> 31724    5113
>>>> 32443    5113
>>>> 10370    5113
>>>>
>>>> I am not sure what I may be doing wrong, but is there perhaps some form
>>>> of normalization lacking in my execution of:
>>>>
>>>> org.apache.flink.graph.library.clustering.undirected.LocalCl
>>>> usteringCoefficient.Result;
>>>> org.apache.flink.graph.library.clustering.undirected.LocalCl
>>>> usteringCoefficient;
>>>>
>>>> Am I supposed to divide all scores by the greatest score obtained by
>>>> the algorithm?
>>>>
>>>> Thank you very much!
>>>>
>>>> Miguel E. Coimbra
>>>> Email: miguel.e.coimbra@gmail.com <mi...@ist.utl.pt>
>>>> Skype: miguel.e.coimbra
>>>>
>>>
>>>
>>
>

Re: Apache Flink 1.1.4 - Gelly - LocalClusteringCoefficient - Returning values above 1?

Posted by Miguel Coimbra <mi...@gmail.com>.
Hello Vasia and Greg,

Thank you for the feedback!

I am probably misusing the Gelly API in some way, but I thought I could run
the undirected version after calling getUndirected()?
While not going into the concept of local clustering coefficients, I
thought that from a Gelly API point-of-view, both my code and data set were
properly established.
However:
- I believe that the graph was already undirected;
- I am getting NaN results after executing the algorithm.

This is the code I am using to obtain an (undirected) graph instance upon
which I call LocalClusteringCoefficient:


import
org.apache.flink.graph.library.clustering.undirected.LocalClusteringCoefficient.Result;
import
org.apache.flink.graph.library.clustering.undirected.LocalClusteringCoefficient;
/** other imports and method definitions **/

// Generate edge tuples from the input file.
final DataSet<Tuple2<LongValue, LongValue>> edgeTuples =
env.readCsvFile(inputPath)
    .fieldDelimiter("\t") // node IDs are separated by spaces
    .ignoreComments("#")  // comments start with "%"
    .types(LongValue.class, LongValue.class);

// Generate actual Edge<Long, Double> instances.
@SuppressWarnings("serial")
final DataSet<Edge<LongValue, Double>> edges = edgeTuples.map(
    new MapFunction<Tuple2<LongValue, LongValue>, Edge<LongValue,
Double>>() {
        @Override
        public Edge<LongValue, Double> map(Tuple2<LongValue, LongValue>
arg0) throws Exception {
            return new Edge<LongValue, Double>(arg0.f0,  arg0.f1, 1.0d);
        }
    });

// Generate the basic graph.
@SuppressWarnings("serial")
final Graph<LongValue, Double, Double> graph = Graph.fromDataSet(
    edges,
    new MapFunction<LongValue, Double>() {
        @Override
        public Double map(LongValue arg0) throws Exception {
            // For testing purposes, just setting each vertex value to 1.0.
            return 1.0;
        }
    },
    env).*getUndirected(*);

// Execute the LocalClusteringCoefficient algorithm.
final DataSet<Result<LongValue>> localClusteringCoefficients =
graph.run(new LocalClusteringCoefficient<LongValue, Double, Double>());

// Get the values as per Vasia's help:
@SuppressWarnings("serial")
DataSet<Double> *CLUSTERING_COEFFICIENTS* =
localClusteringCoefficients.map(new MapFunction<Result<LongValue>,
Double>() {
    @Override
    public Double map(Result<LongValue> arg0) throws Exception {
        return arg0.getLocalClusteringCoefficientScore();
    }
});

I believe this is the correct way to get a DataSet<Double> of coefficients
from a DataSet<Result<LongValue>> ?
Among the coefficients are a lot of NaN values:

*CLUSTERING_COEFFICIENTS*.print();

NaN
0.0
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN

Apologies for the verbosity in advance, but just to provide detail,
printing the graph edges yields this (notice that for each pair or vertices
there are two links, which are original and the reverse version derived
from getUndirected()).

*Greg:* I therefore believe the *graph is undirected*:

graph.getEdgesAsTuple3().print();
(5113,6008,1.0)
(6008,5113,1.0)
(5113,6774,1.0)
(6774,5113,1.0)
(5113,32938,1.0)
(32938,5113,1.0)
(5113,6545,1.0)
(6545,5113,1.0)
(5113,7088,1.0)
(7088,5113,1.0)
(5113,37929,1.0)
(37929,5113,1.0)
(5113,26562,1.0)
(26562,5113,1.0)
(5113,6107,1.0)
(6107,5113,1.0)
(5113,7171,1.0)
(7171,5113,1.0)
(5113,6192,1.0)
(6192,5113,1.0)
(5113,7763,1.0)
(7763,5113,1.0)
(9748,5113,1.0)
(5113,9748,1.0)
(10191,5113,1.0)
(5113,10191,1.0)
(6064,5113,1.0)
(5113,6064,1.0)
(6065,5113,1.0)
(5113,6065,1.0)
(6279,5113,1.0)
(5113,6279,1.0)
(4907,5113,1.0)
(5113,4907,1.0)
(6465,5113,1.0)
(5113,6465,1.0)
(6707,5113,1.0)
(5113,6707,1.0)
(7089,5113,1.0)
(5113,7089,1.0)
(7172,5113,1.0)
(5113,7172,1.0)
(14310,5113,1.0)
(5113,14310,1.0)
(6252,5113,1.0)
(5113,6252,1.0)
(33855,5113,1.0)
(5113,33855,1.0)
(7976,5113,1.0)
(5113,7976,1.0)
(26284,5113,1.0)
(5113,26284,1.0)
(8056,5113,1.0)
(5113,8056,1.0)
(10371,5113,1.0)
(5113,10371,1.0)
(16785,5113,1.0)
(5113,16785,1.0)
(19801,5113,1.0)
(5113,19801,1.0)
(6715,5113,1.0)
(5113,6715,1.0)
(31724,5113,1.0)
(5113,31724,1.0)
(32443,5113,1.0)
(5113,32443,1.0)
(10370,5113,1.0)
(5113,10370,1.0)

Any insight into what I may be doing wrong would be greatly appreciated.

Thanks for your time,

Kind regards,

Miguel E. Coimbra
Email: miguel.e.coimbra@gmail.com <mi...@ist.utl.pt>
Skype: miguel.e.coimbra

On 20 January 2017 at 19:31, Greg Hogan <co...@greghogan.com> wrote:

> Hi Miguel,
>
> The '--output print' option describes the values and also displays the
> local clustering coefficient value.
>
> You're running the undirected algorithm on a directed graph. In 1.2 there
> is an option '--simplify true' that will add reverse edges and remove
> duplicate edges and self-loops. Alternatively, it looks like you could
> simply add reverse edges to your input file (with an optional ' | sort |
> uniq' following):
>
> $ cat edges.txt | awk ' { print $1, $2; print $2, $1 } '
>
> The drivers are being reworked for 1.3 to better reuse code and options
> which will better support additional drivers and algorithms and make
> documentation simpler.
>
> Greg
>
> On Fri, Jan 20, 2017 at 2:06 PM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com> wrote:
>
>> Hi Miguel,
>>
>> the LocalClusteringCoefficient algorithm returns a DataSet of type Result,
>> which basically wraps a vertex id, its degree, and the number of triangles
>> containing this vertex. The number 11 you see is indeed the degree of
>> vertex 5113. The Result type contains the method
>> getLocalClusteringCoefficientScore() which allows you to retrieve the
>> clustering coefficient score for a vertex. The method simply divides the
>> numbers of triangles by the number of potential edges between neighbors.
>>
>> I'm sorry that you this is not clear in the docs. We should definitely
>> improve them to explain what is the output and how to retrieve the actual
>> clustering coefficient values. I have opened a JIRA for this [1].
>>
>> Cheers,
>> -Vasia.
>>
>> [1]: https://issues.apache.org/jira/browse/FLINK-5597
>>
>> On 20 January 2017 at 19:31, Miguel Coimbra <mi...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> In the documentation of the LocalClusteringCoefficient algorithm, it is
>>> said:
>>>
>>>
>>> *The local clustering coefficient measures the connectedness of each
>>> vertex’s neighborhood.Scores range from 0.0 (no edges between neighbors) to
>>> 1.0 (neighborhood is a clique).*
>>>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.1/
>>> apis/batch/libs/gelly.html#local-clustering-coefficient
>>> <https://ci.apache.org/projects/flink/flink-docs-master/dev/libs/gelly/library_methods.html#local-clustering-coefficient>
>>>
>>> However, upon running the algorithm (undirected version), I obtained
>>> values above 1.
>>>
>>> The result I got was this. As you can see, vertex 5113 has a score of
>>> 11:
>>> (the input edges for the graph are shown further below - around *35
>>> edges*):
>>>
>>> (4907,(1,0))
>>> *(5113,(11,0))*
>>> (6008,(0,0))
>>> (6064,(1,0))
>>> (6065,(1,0))
>>> (6107,(0,0))
>>> (6192,(0,0))
>>> (6252,(1,0))
>>> (6279,(1,0))
>>> (6465,(1,0))
>>> (6545,(0,0))
>>> (6707,(1,0))
>>> (6715,(1,0))
>>> (6774,(0,0))
>>> (7088,(0,0))
>>> (7089,(1,0))
>>> (7171,(0,0))
>>> (7172,(1,0))
>>> (7763,(0,0))
>>> (7976,(1,0))
>>> (8056,(1,0))
>>> (9748,(1,0))
>>> (10191,(1,0))
>>> (10370,(1,0))
>>> (10371,(1,0))
>>> (14310,(1,0))
>>> (16785,(1,0))
>>> (19801,(1,0))
>>> (26284,(1,0))
>>> (26562,(0,0))
>>> (31724,(1,0))
>>> (32443,(1,0))
>>> (32938,(0,0))
>>> (33855,(1,0))
>>> (37929,(0,0))
>>>
>>> This was from a small isolated test with these edges:
>>>
>>> 5113    6008
>>> 5113    6774
>>> 5113    32938
>>> 5113    6545
>>> 5113    7088
>>> 5113    37929
>>> 5113    26562
>>> 5113    6107
>>> 5113    7171
>>> 5113    6192
>>> 5113    7763
>>> 9748    5113
>>> 10191    5113
>>> 6064    5113
>>> 6065    5113
>>> 6279    5113
>>> 4907    5113
>>> 6465    5113
>>> 6707    5113
>>> 7089    5113
>>> 7172    5113
>>> 14310    5113
>>> 6252    5113
>>> 33855    5113
>>> 7976    5113
>>> 26284    5113 <262%20845%20113>
>>> 8056    5113
>>> 10371    5113
>>> 16785    5113
>>> 19801    5113
>>> 6715    5113
>>> 31724    5113
>>> 32443    5113
>>> 10370    5113
>>>
>>> I am not sure what I may be doing wrong, but is there perhaps some form
>>> of normalization lacking in my execution of:
>>>
>>> org.apache.flink.graph.library.clustering.undirected.LocalCl
>>> usteringCoefficient.Result;
>>> org.apache.flink.graph.library.clustering.undirected.LocalCl
>>> usteringCoefficient;
>>>
>>> Am I supposed to divide all scores by the greatest score obtained by the
>>> algorithm?
>>>
>>> Thank you very much!
>>>
>>> Miguel E. Coimbra
>>> Email: miguel.e.coimbra@gmail.com <mi...@ist.utl.pt>
>>> Skype: miguel.e.coimbra
>>>
>>
>>
>

Re: Apache Flink 1.1.4 - Gelly - LocalClusteringCoefficient - Returning values above 1?

Posted by Greg Hogan <co...@greghogan.com>.
Hi Miguel,

The '--output print' option describes the values and also displays the
local clustering coefficient value.

You're running the undirected algorithm on a directed graph. In 1.2 there
is an option '--simplify true' that will add reverse edges and remove
duplicate edges and self-loops. Alternatively, it looks like you could
simply add reverse edges to your input file (with an optional ' | sort |
uniq' following):

$ cat edges.txt | awk ' { print $1, $2; print $2, $1 } '

The drivers are being reworked for 1.3 to better reuse code and options
which will better support additional drivers and algorithms and make
documentation simpler.

Greg

On Fri, Jan 20, 2017 at 2:06 PM, Vasiliki Kalavri <vasilikikalavri@gmail.com
> wrote:

> Hi Miguel,
>
> the LocalClusteringCoefficient algorithm returns a DataSet of type Result,
> which basically wraps a vertex id, its degree, and the number of triangles
> containing this vertex. The number 11 you see is indeed the degree of
> vertex 5113. The Result type contains the method
> getLocalClusteringCoefficientScore() which allows you to retrieve the
> clustering coefficient score for a vertex. The method simply divides the
> numbers of triangles by the number of potential edges between neighbors.
>
> I'm sorry that you this is not clear in the docs. We should definitely
> improve them to explain what is the output and how to retrieve the actual
> clustering coefficient values. I have opened a JIRA for this [1].
>
> Cheers,
> -Vasia.
>
> [1]: https://issues.apache.org/jira/browse/FLINK-5597
>
> On 20 January 2017 at 19:31, Miguel Coimbra <mi...@gmail.com>
> wrote:
>
>> Hello,
>>
>> In the documentation of the LocalClusteringCoefficient algorithm, it is
>> said:
>>
>>
>> *The local clustering coefficient measures the connectedness of each
>> vertex’s neighborhood.Scores range from 0.0 (no edges between neighbors) to
>> 1.0 (neighborhood is a clique).*
>>
>> https://ci.apache.org/projects/flink/flink-docs-release-1.1/
>> apis/batch/libs/gelly.html#local-clustering-coefficient
>> <https://ci.apache.org/projects/flink/flink-docs-master/dev/libs/gelly/library_methods.html#local-clustering-coefficient>
>>
>> However, upon running the algorithm (undirected version), I obtained
>> values above 1.
>>
>> The result I got was this. As you can see, vertex 5113 has a score of 11:
>> (the input edges for the graph are shown further below - around *35
>> edges*):
>>
>> (4907,(1,0))
>> *(5113,(11,0))*
>> (6008,(0,0))
>> (6064,(1,0))
>> (6065,(1,0))
>> (6107,(0,0))
>> (6192,(0,0))
>> (6252,(1,0))
>> (6279,(1,0))
>> (6465,(1,0))
>> (6545,(0,0))
>> (6707,(1,0))
>> (6715,(1,0))
>> (6774,(0,0))
>> (7088,(0,0))
>> (7089,(1,0))
>> (7171,(0,0))
>> (7172,(1,0))
>> (7763,(0,0))
>> (7976,(1,0))
>> (8056,(1,0))
>> (9748,(1,0))
>> (10191,(1,0))
>> (10370,(1,0))
>> (10371,(1,0))
>> (14310,(1,0))
>> (16785,(1,0))
>> (19801,(1,0))
>> (26284,(1,0))
>> (26562,(0,0))
>> (31724,(1,0))
>> (32443,(1,0))
>> (32938,(0,0))
>> (33855,(1,0))
>> (37929,(0,0))
>>
>> This was from a small isolated test with these edges:
>>
>> 5113    6008
>> 5113    6774
>> 5113    32938
>> 5113    6545
>> 5113    7088
>> 5113    37929
>> 5113    26562
>> 5113    6107
>> 5113    7171
>> 5113    6192
>> 5113    7763
>> 9748    5113
>> 10191    5113
>> 6064    5113
>> 6065    5113
>> 6279    5113
>> 4907    5113
>> 6465    5113
>> 6707    5113
>> 7089    5113
>> 7172    5113
>> 14310    5113
>> 6252    5113
>> 33855    5113
>> 7976    5113
>> 26284    5113
>> 8056    5113
>> 10371    5113
>> 16785    5113
>> 19801    5113
>> 6715    5113
>> 31724    5113
>> 32443    5113
>> 10370    5113
>>
>> I am not sure what I may be doing wrong, but is there perhaps some form
>> of normalization lacking in my execution of:
>>
>> org.apache.flink.graph.library.clustering.undirected.LocalCl
>> usteringCoefficient.Result;
>> org.apache.flink.graph.library.clustering.undirected.LocalCl
>> usteringCoefficient;
>>
>> Am I supposed to divide all scores by the greatest score obtained by the
>> algorithm?
>>
>> Thank you very much!
>>
>> Miguel E. Coimbra
>> Email: miguel.e.coimbra@gmail.com <mi...@ist.utl.pt>
>> Skype: miguel.e.coimbra
>>
>
>

Re: Apache Flink 1.1.4 - Gelly - LocalClusteringCoefficient - Returning values above 1?

Posted by Vasiliki Kalavri <va...@gmail.com>.
Hi Miguel,

the LocalClusteringCoefficient algorithm returns a DataSet of type Result,
which basically wraps a vertex id, its degree, and the number of triangles
containing this vertex. The number 11 you see is indeed the degree of
vertex 5113. The Result type contains the method
getLocalClusteringCoefficientScore() which allows you to retrieve the
clustering coefficient score for a vertex. The method simply divides the
numbers of triangles by the number of potential edges between neighbors.

I'm sorry that you this is not clear in the docs. We should definitely
improve them to explain what is the output and how to retrieve the actual
clustering coefficient values. I have opened a JIRA for this [1].

Cheers,
-Vasia.

[1]: https://issues.apache.org/jira/browse/FLINK-5597

On 20 January 2017 at 19:31, Miguel Coimbra <mi...@gmail.com>
wrote:

> Hello,
>
> In the documentation of the LocalClusteringCoefficient algorithm, it is
> said:
>
>
> *The local clustering coefficient measures the connectedness of each
> vertex’s neighborhood.Scores range from 0.0 (no edges between neighbors) to
> 1.0 (neighborhood is a clique).*
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.1/
> apis/batch/libs/gelly.html#local-clustering-coefficient
> <https://ci.apache.org/projects/flink/flink-docs-master/dev/libs/gelly/library_methods.html#local-clustering-coefficient>
>
> However, upon running the algorithm (undirected version), I obtained
> values above 1.
>
> The result I got was this. As you can see, vertex 5113 has a score of 11:
> (the input edges for the graph are shown further below - around *35 edges*
> ):
>
> (4907,(1,0))
> *(5113,(11,0))*
> (6008,(0,0))
> (6064,(1,0))
> (6065,(1,0))
> (6107,(0,0))
> (6192,(0,0))
> (6252,(1,0))
> (6279,(1,0))
> (6465,(1,0))
> (6545,(0,0))
> (6707,(1,0))
> (6715,(1,0))
> (6774,(0,0))
> (7088,(0,0))
> (7089,(1,0))
> (7171,(0,0))
> (7172,(1,0))
> (7763,(0,0))
> (7976,(1,0))
> (8056,(1,0))
> (9748,(1,0))
> (10191,(1,0))
> (10370,(1,0))
> (10371,(1,0))
> (14310,(1,0))
> (16785,(1,0))
> (19801,(1,0))
> (26284,(1,0))
> (26562,(0,0))
> (31724,(1,0))
> (32443,(1,0))
> (32938,(0,0))
> (33855,(1,0))
> (37929,(0,0))
>
> This was from a small isolated test with these edges:
>
> 5113    6008
> 5113    6774
> 5113    32938
> 5113    6545
> 5113    7088
> 5113    37929
> 5113    26562
> 5113    6107
> 5113    7171
> 5113    6192
> 5113    7763
> 9748    5113
> 10191    5113
> 6064    5113
> 6065    5113
> 6279    5113
> 4907    5113
> 6465    5113
> 6707    5113
> 7089    5113
> 7172    5113
> 14310    5113
> 6252    5113
> 33855    5113
> 7976    5113
> 26284    5113
> 8056    5113
> 10371    5113
> 16785    5113
> 19801    5113
> 6715    5113
> 31724    5113
> 32443    5113
> 10370    5113
>
> I am not sure what I may be doing wrong, but is there perhaps some form of
> normalization lacking in my execution of:
>
> org.apache.flink.graph.library.clustering.undirected.
> LocalClusteringCoefficient.Result;
> org.apache.flink.graph.library.clustering.undirected.
> LocalClusteringCoefficient;
>
> Am I supposed to divide all scores by the greatest score obtained by the
> algorithm?
>
> Thank you very much!
>
> Miguel E. Coimbra
> Email: miguel.e.coimbra@gmail.com <mi...@ist.utl.pt>
> Skype: miguel.e.coimbra
>