You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by greghogan <gi...@git.apache.org> on 2016/05/11 04:17:47 UTC

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

GitHub user greghogan opened a pull request:

    https://github.com/apache/flink/pull/1980

    [FLINK-3780] [gelly] Jaccard Similarity

    The Jaccard Index measures the similarity between vertex neighborhoods. Scores range from 0.0 (no common neighbors) to 1.0 (all neighbors are common).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/greghogan/flink 3780_jaccard_similarity

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1980.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1980
    
----
commit 8603c7a9a2de9f0071d8c677a36a5c300d35f40a
Author: Greg Hogan <co...@greghogan.com>
Date:   2016-05-09T18:45:15Z

    [FLINK-3780] [gelly] Jaccard Similarity
    
    The Jaccard Index measures the similarity between vertex neighborhoods.
    Scores range from 0.0 (no common neighbors) to 1.0 (all neighbors are
    common).

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63894239
  
    --- Diff: flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/similarity/JaccardIndex.java ---
    @@ -0,0 +1,462 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.library.similarity;
    +
    +import org.apache.flink.api.common.ExecutionConfig;
    +import org.apache.flink.api.common.functions.FlatMapFunction;
    +import org.apache.flink.api.common.functions.GroupReduceFunction;
    +import org.apache.flink.api.common.operators.Order;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.functions.FunctionAnnotation;
    +import org.apache.flink.api.java.tuple.Tuple2;
    +import org.apache.flink.api.java.tuple.Tuple3;
    +import org.apache.flink.api.java.tuple.Tuple4;
    +import org.apache.flink.graph.Edge;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.GraphAlgorithm;
    +import org.apache.flink.graph.asm.degree.annotate.undirected.EdgeTargetDegree;
    +import org.apache.flink.graph.library.similarity.JaccardIndex.Result;
    +import org.apache.flink.graph.utils.Murmur3_32;
    +import org.apache.flink.types.CopyableValue;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.util.Collector;
    +import org.apache.flink.util.Preconditions;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +/**
    + * The Jaccard Index measures the similarity between vertex neighborhoods.
    + * Scores range from 0.0 (no common neighbors) to 1.0 (all neighbors are common).
    + * <br/>
    + * This implementation produces similarity scores for each pair of vertices
    + * in the graph with at least one common neighbor; equivalently, this is the
    + * set of all non-zero Jaccard Similarity coefficients.
    + * <br/>
    + * The input graph must be a simple, undirected graph containing no duplicate
    + * edges or self-loops.
    + *
    + * @param <K> graph ID type
    + * @param <VV> vertex value type
    + * @param <EV> edge value type
    + */
    +public class JaccardIndex<K extends CopyableValue<K>, VV, EV>
    +implements GraphAlgorithm<K, VV, EV, DataSet<Result<K>>> {
    +
    +	public static final int DEFAULT_GROUP_SIZE = 64;
    +
    +	// Optional configuration
    +	private int groupSize = DEFAULT_GROUP_SIZE;
    +
    +	private boolean unboundedScores = true;
    +
    +	private int minimumScoreNumerator = 0;
    +
    +	private int minimumScoreDenominator = 1;
    +
    +	private int maximumScoreNumerator = 1;
    +
    +	private int maximumScoreDenominator = 0;
    +
    +	private int littleParallelism = ExecutionConfig.PARALLELISM_UNKNOWN;
    +
    +	/**
    +	 * Override the default group size for the quadratic expansion of neighbor
    +	 * pairs. Small groups generate more data whereas large groups distribute
    +	 * computation less evenly among tasks.
    +	 *
    +	 * @param groupSize the group size for the quadratic expansion of neighbor pairs
    +	 * @return this
    +	 */
    +	public JaccardIndex<K, VV, EV> setGroupSize(int groupSize) {
    +		Preconditions.checkArgument(groupSize > 0, "Group size must be greater than zero");
    +
    +		this.groupSize = groupSize;
    +
    +		return this;
    +	}
    +
    +	/**
    +	 * Filter out Jaccard Index scores less than the given minimum fraction.
    +	 *
    +	 * @param numerator numerator of the minimum score
    +	 * @param denominator denominator of the minimum score
    +	 * @return this
    +	 * @see #setMaximumScore(int, int)
    +	 */
    +	public JaccardIndex<K, VV, EV> setMinimumScore(int numerator, int denominator) {
    --- End diff --
    
    This should be documented in the library usage docs too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63893365
  
    --- Diff: docs/apis/batch/libs/gelly.md ---
    @@ -2051,6 +2052,26 @@ The algorithm takes a directed, vertex (and possibly edge) attributed graph as i
     vertex represents a group of vertices and each edge represents a group of edges from the input graph. Furthermore, each
     vertex and edge in the output graph stores the common group value and the number of represented elements.
     
    +### Jaccard Index
    +
    +#### Overview
    +The Jaccard Index measures the similarity between vertex neighborhoods. Scores range from 0.0 (no common neighbors) to
    +1.0 (all neighbors are common).
    +
    +#### Details
    +Counting common neighbors for pairs of vertices is equivalent to counting the two-paths consisting of two edges
    +connecting the two vertices to the common neighbor. The number of distinct neighbors for pairs of vertices is computed
    +by storing the sum of degrees of the vertex pair and subtracting the count of common neighbors, which are double-counted
    +in the sum of degrees.
    +
    +The algorithm first annotates each edge with the endpoint degree. Grouping on the midpoint vertex, each pair of
    +neighbors is emitted with the endpoint degree sum. Grouping on two-paths, the common neighbors are counted.
    +
    +#### Usage
    +The algorithm takes a simple, undirected graph as input and outputs a `DataSet` of tuples containing two vertex IDs,
    +the number of common neighbors, and the number of distinct neighbors. The graph ID type must be `Comparable` and
    --- End diff --
    
    It does, from `Result.getJaccardIndexScore()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63893564
  
    --- Diff: docs/apis/batch/libs/gelly.md ---
    @@ -2051,6 +2052,26 @@ The algorithm takes a directed, vertex (and possibly edge) attributed graph as i
     vertex represents a group of vertices and each edge represents a group of edges from the input graph. Furthermore, each
     vertex and edge in the output graph stores the common group value and the number of represented elements.
     
    +### Jaccard Index
    +
    +#### Overview
    +The Jaccard Index measures the similarity between vertex neighborhoods. Scores range from 0.0 (no common neighbors) to
    +1.0 (all neighbors are common).
    +
    +#### Details
    +Counting common neighbors for pairs of vertices is equivalent to counting the two-paths consisting of two edges
    +connecting the two vertices to the common neighbor. The number of distinct neighbors for pairs of vertices is computed
    +by storing the sum of degrees of the vertex pair and subtracting the count of common neighbors, which are double-counted
    +in the sum of degrees.
    +
    +The algorithm first annotates each edge with the endpoint degree. Grouping on the midpoint vertex, each pair of
    +neighbors is emitted with the endpoint degree sum. Grouping on two-paths, the common neighbors are counted.
    +
    +#### Usage
    +The algorithm takes a simple, undirected graph as input and outputs a `DataSet` of tuples containing two vertex IDs,
    +the number of common neighbors, and the number of distinct neighbors. The graph ID type must be `Comparable` and
    --- End diff --
    
    Ah great! Can you add this here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63894069
  
    --- Diff: flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/similarity/JaccardIndex.java ---
    @@ -0,0 +1,462 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.library.similarity;
    +
    +import org.apache.flink.api.common.ExecutionConfig;
    +import org.apache.flink.api.common.functions.FlatMapFunction;
    +import org.apache.flink.api.common.functions.GroupReduceFunction;
    +import org.apache.flink.api.common.operators.Order;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.functions.FunctionAnnotation;
    +import org.apache.flink.api.java.tuple.Tuple2;
    +import org.apache.flink.api.java.tuple.Tuple3;
    +import org.apache.flink.api.java.tuple.Tuple4;
    +import org.apache.flink.graph.Edge;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.GraphAlgorithm;
    +import org.apache.flink.graph.asm.degree.annotate.undirected.EdgeTargetDegree;
    +import org.apache.flink.graph.library.similarity.JaccardIndex.Result;
    +import org.apache.flink.graph.utils.Murmur3_32;
    +import org.apache.flink.types.CopyableValue;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.util.Collector;
    +import org.apache.flink.util.Preconditions;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +/**
    + * The Jaccard Index measures the similarity between vertex neighborhoods.
    + * Scores range from 0.0 (no common neighbors) to 1.0 (all neighbors are common).
    + * <br/>
    + * This implementation produces similarity scores for each pair of vertices
    + * in the graph with at least one common neighbor; equivalently, this is the
    + * set of all non-zero Jaccard Similarity coefficients.
    + * <br/>
    + * The input graph must be a simple, undirected graph containing no duplicate
    + * edges or self-loops.
    + *
    + * @param <K> graph ID type
    + * @param <VV> vertex value type
    + * @param <EV> edge value type
    + */
    +public class JaccardIndex<K extends CopyableValue<K>, VV, EV>
    +implements GraphAlgorithm<K, VV, EV, DataSet<Result<K>>> {
    +
    +	public static final int DEFAULT_GROUP_SIZE = 64;
    +
    +	// Optional configuration
    +	private int groupSize = DEFAULT_GROUP_SIZE;
    +
    +	private boolean unboundedScores = true;
    +
    +	private int minimumScoreNumerator = 0;
    +
    +	private int minimumScoreDenominator = 1;
    +
    +	private int maximumScoreNumerator = 1;
    +
    +	private int maximumScoreDenominator = 0;
    +
    +	private int littleParallelism = ExecutionConfig.PARALLELISM_UNKNOWN;
    +
    +	/**
    +	 * Override the default group size for the quadratic expansion of neighbor
    +	 * pairs. Small groups generate more data whereas large groups distribute
    +	 * computation less evenly among tasks.
    +	 *
    +	 * @param groupSize the group size for the quadratic expansion of neighbor pairs
    +	 * @return this
    +	 */
    +	public JaccardIndex<K, VV, EV> setGroupSize(int groupSize) {
    --- End diff --
    
    Could you maybe extend the description of what groupSize does? And it would be nice to add documentation for this method in the library docs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the pull request:

    https://github.com/apache/flink/pull/1980#issuecomment-220116075
  
    Hi @vasia that would be great if you have time after 2044.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63893177
  
    --- Diff: flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/asm/degree/annotate/TranslateEdgeDegreeToIntValue.java ---
    @@ -0,0 +1,51 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.asm.degree.annotate;
    +
    +import org.apache.flink.api.java.tuple.Tuple2;
    +import org.apache.flink.graph.asm.translate.TranslateFunction;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +
    +/**
    + * Translate the edge degree returned by the degree annotation functions from
    + * {@link LongValue} to {@link IntValue}.
    + *
    + * @param <T> edge value type
    + */
    +public class TranslateEdgeDegreeToIntValue<T>
    --- End diff --
    
    This is a general translator from `LongValue` to `IntValue`, not just for edge degrees, right? Maybe we could rename it to demonstrate that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63899403
  
    --- Diff: flink-libraries/flink-gelly/src/test/java/org/apache/flink/graph/library/similarity/JaccardIndexTest.java ---
    @@ -0,0 +1,136 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.library.similarity;
    +
    +import org.apache.commons.math3.random.JDKRandomGenerator;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.Utils.ChecksumHashCode;
    +import org.apache.flink.api.java.utils.DataSetUtils;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.asm.AsmTestBase;
    +import org.apache.flink.graph.generator.RMatGraph;
    +import org.apache.flink.graph.generator.random.JDKRandomGeneratorFactory;
    +import org.apache.flink.graph.generator.random.RandomGenerableFactory;
    +import org.apache.flink.graph.library.similarity.JaccardIndex.Result;
    +import org.apache.flink.test.util.TestBaseUtils;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.types.NullValue;
    +import org.junit.Test;
    +
    +import static org.junit.Assert.assertEquals;
    +
    +public class JaccardIndexTest
    +extends AsmTestBase {
    +
    +	@Test
    +	public void testSimpleGraph()
    +			throws Exception {
    +		DataSet<Result<IntValue>> ji = undirectedSimpleGraph
    +			.run(new JaccardIndex<IntValue, NullValue, NullValue>());
    +
    +		String expectedResult =
    +			"(0,1,(1,4))\n" +
    +			"(0,2,(1,4))\n" +
    +			"(0,3,(2,4))\n" +
    +			"(1,2,(2,4))\n" +
    +			"(1,3,(1,6))\n" +
    +			"(1,4,(1,3))\n" +
    +			"(1,5,(1,3))\n" +
    +			"(2,3,(1,6))\n" +
    +			"(2,4,(1,3))\n" +
    +			"(2,5,(1,3))\n" +
    +			"(4,5,(1,1))\n";
    +
    +		TestBaseUtils.compareResultAsText(ji.collect(), expectedResult);
    +	}
    +
    +	@Test
    +	public void testSimpleGraphWithMinimumScore()
    +			throws Exception {
    +		DataSet<Result<IntValue>> ji = undirectedSimpleGraph
    +			.run(new JaccardIndex<IntValue, NullValue, NullValue>()
    +				.setMinimumScore(1, 2));
    +
    +		String expectedResult =
    +			"(0,3,(2,4))\n" +
    +			"(1,2,(2,4))\n" +
    +			"(4,5,(1,1))\n";
    +
    +		TestBaseUtils.compareResultAsText(ji.collect(), expectedResult);
    +	}
    +
    +	@Test
    +	public void testSimpleGraphWithMaximumScore()
    +			throws Exception {
    +		DataSet<Result<IntValue>> ji = undirectedSimpleGraph
    +			.run(new JaccardIndex<IntValue, NullValue, NullValue>()
    +				.setMaximumScore(1, 2));
    +
    +		String expectedResult =
    +			"(0,1,(1,4))\n" +
    +			"(0,2,(1,4))\n" +
    +			"(1,3,(1,6))\n" +
    +			"(1,4,(1,3))\n" +
    +			"(1,5,(1,3))\n" +
    +			"(2,3,(1,6))\n" +
    +			"(2,4,(1,3))\n" +
    +			"(2,5,(1,3))\n";
    +
    +		TestBaseUtils.compareResultAsText(ji.collect(), expectedResult);
    +	}
    +
    +	@Test
    +	public void testCompleteGraph()
    +			throws Exception {
    +		DataSet<Result<LongValue>> ji = completeGraph
    +			.run(new JaccardIndex<LongValue, NullValue, NullValue>()
    +				.setGroupSize(4));
    +
    +		for (Result<LongValue> result : ji.collect()) {
    +			// the intersection includes every vertex
    +			assertEquals(completeGraphVertexCount, result.getDistinctNeighborCount().getValue());
    +
    +			// the union only excludes the two vertices from the similarity score
    +			assertEquals(completeGraphVertexCount - 2, result.getCommonNeighborCount().getValue());
    +		}
    +	}
    +
    +	@Test
    +	public void testRMatGraph()
    +			throws Exception {
    +		long vertexCount = 1 << 8;
    +		long edgeCount = 8 * vertexCount;
    +
    +		RandomGenerableFactory<JDKRandomGenerator> rnd = new JDKRandomGeneratorFactory();
    +
    +		Graph<LongValue, NullValue, NullValue> graph = new RMatGraph<>(env, rnd, vertexCount, edgeCount)
    +			.setSimpleGraph(true, false)
    +			.generate();
    +
    +		DataSet<Result<LongValue>> ji = graph
    +			.run(new JaccardIndex<LongValue, NullValue, NullValue>()
    +				.setGroupSize(4));
    +
    +		ChecksumHashCode checksum = DataSetUtils.checksumHashCode(ji);
    +
    +		assertEquals(13954, checksum.getCount());
    +		assertEquals(0x0000179f83a2a873L, checksum.getChecksum());
    --- End diff --
    
    Then, what does you test here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the pull request:

    https://github.com/apache/flink/pull/1980#issuecomment-220600881
  
    @vasia all updates should be in place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63898835
  
    --- Diff: flink-libraries/flink-gelly/src/test/java/org/apache/flink/graph/library/similarity/JaccardIndexTest.java ---
    @@ -0,0 +1,136 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.library.similarity;
    +
    +import org.apache.commons.math3.random.JDKRandomGenerator;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.Utils.ChecksumHashCode;
    +import org.apache.flink.api.java.utils.DataSetUtils;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.asm.AsmTestBase;
    +import org.apache.flink.graph.generator.RMatGraph;
    +import org.apache.flink.graph.generator.random.JDKRandomGeneratorFactory;
    +import org.apache.flink.graph.generator.random.RandomGenerableFactory;
    +import org.apache.flink.graph.library.similarity.JaccardIndex.Result;
    +import org.apache.flink.test.util.TestBaseUtils;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.types.NullValue;
    +import org.junit.Test;
    +
    +import static org.junit.Assert.assertEquals;
    +
    +public class JaccardIndexTest
    +extends AsmTestBase {
    +
    +	@Test
    +	public void testSimpleGraph()
    +			throws Exception {
    +		DataSet<Result<IntValue>> ji = undirectedSimpleGraph
    +			.run(new JaccardIndex<IntValue, NullValue, NullValue>());
    +
    +		String expectedResult =
    +			"(0,1,(1,4))\n" +
    +			"(0,2,(1,4))\n" +
    +			"(0,3,(2,4))\n" +
    +			"(1,2,(2,4))\n" +
    +			"(1,3,(1,6))\n" +
    +			"(1,4,(1,3))\n" +
    +			"(1,5,(1,3))\n" +
    +			"(2,3,(1,6))\n" +
    +			"(2,4,(1,3))\n" +
    +			"(2,5,(1,3))\n" +
    +			"(4,5,(1,1))\n";
    +
    +		TestBaseUtils.compareResultAsText(ji.collect(), expectedResult);
    +	}
    +
    +	@Test
    +	public void testSimpleGraphWithMinimumScore()
    +			throws Exception {
    +		DataSet<Result<IntValue>> ji = undirectedSimpleGraph
    +			.run(new JaccardIndex<IntValue, NullValue, NullValue>()
    +				.setMinimumScore(1, 2));
    +
    +		String expectedResult =
    +			"(0,3,(2,4))\n" +
    +			"(1,2,(2,4))\n" +
    +			"(4,5,(1,1))\n";
    +
    +		TestBaseUtils.compareResultAsText(ji.collect(), expectedResult);
    +	}
    +
    +	@Test
    +	public void testSimpleGraphWithMaximumScore()
    +			throws Exception {
    +		DataSet<Result<IntValue>> ji = undirectedSimpleGraph
    +			.run(new JaccardIndex<IntValue, NullValue, NullValue>()
    +				.setMaximumScore(1, 2));
    +
    +		String expectedResult =
    +			"(0,1,(1,4))\n" +
    +			"(0,2,(1,4))\n" +
    +			"(1,3,(1,6))\n" +
    +			"(1,4,(1,3))\n" +
    +			"(1,5,(1,3))\n" +
    +			"(2,3,(1,6))\n" +
    +			"(2,4,(1,3))\n" +
    +			"(2,5,(1,3))\n";
    +
    +		TestBaseUtils.compareResultAsText(ji.collect(), expectedResult);
    +	}
    +
    +	@Test
    +	public void testCompleteGraph()
    +			throws Exception {
    +		DataSet<Result<LongValue>> ji = completeGraph
    +			.run(new JaccardIndex<LongValue, NullValue, NullValue>()
    +				.setGroupSize(4));
    +
    +		for (Result<LongValue> result : ji.collect()) {
    +			// the intersection includes every vertex
    +			assertEquals(completeGraphVertexCount, result.getDistinctNeighborCount().getValue());
    +
    +			// the union only excludes the two vertices from the similarity score
    +			assertEquals(completeGraphVertexCount - 2, result.getCommonNeighborCount().getValue());
    +		}
    +	}
    +
    +	@Test
    +	public void testRMatGraph()
    +			throws Exception {
    +		long vertexCount = 1 << 8;
    +		long edgeCount = 8 * vertexCount;
    +
    +		RandomGenerableFactory<JDKRandomGenerator> rnd = new JDKRandomGeneratorFactory();
    +
    +		Graph<LongValue, NullValue, NullValue> graph = new RMatGraph<>(env, rnd, vertexCount, edgeCount)
    +			.setSimpleGraph(true, false)
    +			.generate();
    +
    +		DataSet<Result<LongValue>> ji = graph
    +			.run(new JaccardIndex<LongValue, NullValue, NullValue>()
    +				.setGroupSize(4));
    +
    +		ChecksumHashCode checksum = DataSetUtils.checksumHashCode(ji);
    +
    +		assertEquals(13954, checksum.getCount());
    +		assertEquals(0x0000179f83a2a873L, checksum.getChecksum());
    --- End diff --
    
    By running the algorithm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r64047578
  
    --- Diff: docs/apis/batch/libs/gelly.md ---
    @@ -2055,22 +2055,22 @@ vertex and edge in the output graph stores the common group value and the number
     ### Jaccard Index
     
     #### Overview
    -The Jaccard Index measures the similarity between vertex neighborhoods. Scores range from 0.0 (no common neighbors) to
    -1.0 (all neighbors are common).
    +The Jaccard Index measures the similarity between vertex neighborhoods and is computed as the number of shared numbers
    --- End diff --
    
    shared numbers -> shared *neighbors?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63951105
  
    --- Diff: flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/JaccardIndex.java ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.examples;
    +
    +import org.apache.commons.math3.random.JDKRandomGenerator;
    +import org.apache.flink.api.common.JobExecutionResult;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.ExecutionEnvironment;
    +import org.apache.flink.api.java.io.CsvOutputFormat;
    +import org.apache.flink.api.java.utils.DataSetUtils;
    +import org.apache.flink.api.java.utils.ParameterTool;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.asm.translate.LongValueToIntValue;
    +import org.apache.flink.graph.asm.translate.TranslateGraphIds;
    +import org.apache.flink.graph.generator.RMatGraph;
    +import org.apache.flink.graph.generator.random.JDKRandomGeneratorFactory;
    +import org.apache.flink.graph.generator.random.RandomGenerableFactory;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.types.NullValue;
    +
    +import java.text.NumberFormat;
    +
    +/**
    + * Driver for the library implementation of Jaccard Index.
    + *
    + * This example generates an undirected RMat graph with the given scale and
    + * edge factor then calculates all non-zero Jaccard Index similarity scores
    + * between vertices.
    + *
    + * @see org.apache.flink.graph.library.similarity.JaccardIndex
    + */
    +public class JaccardIndex {
    +
    +	public static final int DEFAULT_SCALE = 10;
    +
    +	public static final int DEFAULT_EDGE_FACTOR = 16;
    +
    +	public static final boolean DEFAULT_CLIP_AND_FLIP = true;
    +
    +	public static void main(String[] args) throws Exception {
    +		// Set up the execution environment
    +		final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    +		env.getConfig().enableObjectReuse();
    +
    +		ParameterTool parameters = ParameterTool.fromArgs(args);
    +
    +		// Generate RMat graph
    +		int scale = parameters.getInt("scale", DEFAULT_SCALE);
    +		int edgeFactor = parameters.getInt("edge_factor", DEFAULT_EDGE_FACTOR);
    +
    +		RandomGenerableFactory<JDKRandomGenerator> rnd = new JDKRandomGeneratorFactory();
    +
    +		long vertexCount = 1L << scale;
    +		long edgeCount = vertexCount * edgeFactor;
    +
    +		boolean clipAndFlip = parameters.getBoolean("clip_and_flip", DEFAULT_CLIP_AND_FLIP);
    +
    +		Graph<LongValue, NullValue, NullValue> graph = new RMatGraph<>(env, rnd, vertexCount, edgeCount)
    +			.setSimpleGraph(true, clipAndFlip)
    +			.generate();
    +
    +		DataSet js;
    +
    +		if (scale > 32) {
    +			js = graph
    +				.run(new org.apache.flink.graph.library.similarity.JaccardIndex<LongValue, NullValue, NullValue>());
    +		} else {
    +			js = graph
    +				.run(new TranslateGraphIds<LongValue, IntValue, NullValue, NullValue>(new LongValueToIntValue()))
    +				.run(new org.apache.flink.graph.library.similarity.JaccardIndex<IntValue, NullValue, NullValue>());
    +		}
    +
    +		switch (parameters.get("output", "")) {
    --- End diff --
    
    Is the usage statement sufficient?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r64048412
  
    --- Diff: flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/JaccardIndex.java ---
    @@ -40,9 +43,9 @@
     /**
      * Driver for the library implementation of Jaccard Index.
      *
    - * This example generates an undirected RMat graph with the given scale and
    - * edge factor then calculates all non-zero Jaccard Index similarity scores
    - * between vertices.
    + * This example reads a simple, undirected graph from a CSV file or generates
    --- End diff --
    
    remove one "generates"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r64048334
  
    --- Diff: docs/apis/batch/libs/gelly.md ---
    @@ -2250,14 +2250,33 @@ graph.run(new TranslateVertexValues(new LongValueAddOffset(vertexCount)));
         </tr>
     
         <tr>
    -      <td>translate.<br/><strong>TranslateEdgeValues</strong></td>
    +      <td>asm.translate.<br/><strong>TranslateEdgeValues</strong></td>
           <td>
             <p>Translate edge values using the given <code>TranslateFunction</code>.</p>
     {% highlight java %}
     graph.run(new TranslateEdgeValues(new Nullify()));
     {% endhighlight %}
           </td>
         </tr>
    +
    +    <tr>
    +      <td>library.similarity.<br/><strong>JaccardIndex</strong></td>
    +      <td>
    +        <p>Measures the similarity between vertex neighborhoods. The Jaccard Index score  is computed as the number of shared numbers divided by the number of distinct neighbors. Scores range from 0.0 (no shared neighbors) to 1.0 (all neighbors are shared).</p>
    --- End diff --
    
    Why did you add this here and not in the "Usage" section of the library method?
    I find it a bit confusing... You describe graph algorithms as building blocks for other algorithms. Does Jaccard index fall in this category?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63890500
  
    --- Diff: docs/apis/batch/libs/gelly.md ---
    @@ -2051,6 +2052,26 @@ The algorithm takes a directed, vertex (and possibly edge) attributed graph as i
     vertex represents a group of vertices and each edge represents a group of edges from the input graph. Furthermore, each
     vertex and edge in the output graph stores the common group value and the number of represented elements.
     
    +### Jaccard Index
    +
    +#### Overview
    +The Jaccard Index measures the similarity between vertex neighborhoods. Scores range from 0.0 (no common neighbors) to
    +1.0 (all neighbors are common).
    +
    +#### Details
    +Counting common neighbors for pairs of vertices is equivalent to counting the two-paths consisting of two edges
    --- End diff --
    
    By "two-paths" you mean triads? i.e. open triangles?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63892449
  
    --- Diff: flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/JaccardIndex.java ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.examples;
    +
    +import org.apache.commons.math3.random.JDKRandomGenerator;
    +import org.apache.flink.api.common.JobExecutionResult;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.ExecutionEnvironment;
    +import org.apache.flink.api.java.io.CsvOutputFormat;
    +import org.apache.flink.api.java.utils.DataSetUtils;
    +import org.apache.flink.api.java.utils.ParameterTool;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.asm.translate.LongValueToIntValue;
    +import org.apache.flink.graph.asm.translate.TranslateGraphIds;
    +import org.apache.flink.graph.generator.RMatGraph;
    +import org.apache.flink.graph.generator.random.JDKRandomGeneratorFactory;
    +import org.apache.flink.graph.generator.random.RandomGenerableFactory;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.types.NullValue;
    +
    +import java.text.NumberFormat;
    +
    +/**
    + * Driver for the library implementation of Jaccard Index.
    + *
    + * This example generates an undirected RMat graph with the given scale and
    + * edge factor then calculates all non-zero Jaccard Index similarity scores
    + * between vertices.
    + *
    + * @see org.apache.flink.graph.library.similarity.JaccardIndex
    + */
    +public class JaccardIndex {
    +
    +	public static final int DEFAULT_SCALE = 10;
    +
    +	public static final int DEFAULT_EDGE_FACTOR = 16;
    +
    +	public static final boolean DEFAULT_CLIP_AND_FLIP = true;
    +
    +	public static void main(String[] args) throws Exception {
    +		// Set up the execution environment
    +		final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    +		env.getConfig().enableObjectReuse();
    +
    +		ParameterTool parameters = ParameterTool.fromArgs(args);
    +
    +		// Generate RMat graph
    +		int scale = parameters.getInt("scale", DEFAULT_SCALE);
    +		int edgeFactor = parameters.getInt("edge_factor", DEFAULT_EDGE_FACTOR);
    +
    +		RandomGenerableFactory<JDKRandomGenerator> rnd = new JDKRandomGeneratorFactory();
    +
    +		long vertexCount = 1L << scale;
    +		long edgeCount = vertexCount * edgeFactor;
    +
    +		boolean clipAndFlip = parameters.getBoolean("clip_and_flip", DEFAULT_CLIP_AND_FLIP);
    +
    +		Graph<LongValue, NullValue, NullValue> graph = new RMatGraph<>(env, rnd, vertexCount, edgeCount)
    +			.setSimpleGraph(true, clipAndFlip)
    +			.generate();
    +
    +		DataSet js;
    +
    +		if (scale > 32) {
    +			js = graph
    +				.run(new org.apache.flink.graph.library.similarity.JaccardIndex<LongValue, NullValue, NullValue>());
    +		} else {
    +			js = graph
    +				.run(new TranslateGraphIds<LongValue, IntValue, NullValue, NullValue>(new LongValueToIntValue()))
    +				.run(new org.apache.flink.graph.library.similarity.JaccardIndex<IntValue, NullValue, NullValue>());
    +		}
    +
    +		switch (parameters.get("output", "")) {
    --- End diff --
    
    Can you add a description for these output parameter options in the class Javadoc?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the pull request:

    https://github.com/apache/flink/pull/1980#issuecomment-220122997
  
    I figured. I forgot to mention that I was looking to reference the JaccardIndex self-join in the discussion of FLINK-3910 but I can link into the PR this time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63892061
  
    --- Diff: flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/JaccardIndex.java ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.examples;
    +
    +import org.apache.commons.math3.random.JDKRandomGenerator;
    +import org.apache.flink.api.common.JobExecutionResult;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.ExecutionEnvironment;
    +import org.apache.flink.api.java.io.CsvOutputFormat;
    +import org.apache.flink.api.java.utils.DataSetUtils;
    +import org.apache.flink.api.java.utils.ParameterTool;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.asm.translate.LongValueToIntValue;
    +import org.apache.flink.graph.asm.translate.TranslateGraphIds;
    +import org.apache.flink.graph.generator.RMatGraph;
    +import org.apache.flink.graph.generator.random.JDKRandomGeneratorFactory;
    +import org.apache.flink.graph.generator.random.RandomGenerableFactory;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.types.NullValue;
    +
    +import java.text.NumberFormat;
    +
    +/**
    + * Driver for the library implementation of Jaccard Index.
    + *
    + * This example generates an undirected RMat graph with the given scale and
    --- End diff --
    
    The rest of the examples in Gelly (and all other Flink components) are designed so that they can be run both with user-specified inputs and without (default data). Could you please modify this example to work in a similar way?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r64048499
  
    --- Diff: flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/JaccardIndex.java ---
    @@ -54,73 +57,112 @@
     
     	public static final boolean DEFAULT_CLIP_AND_FLIP = true;
     
    +	private static void printUsage() {
    +		System.out.println(WordUtils.wrap("The Jaccard Index measures the similarity between vertex" +
    +			" neighborhoods and is computed as the number of shared numbers divided by the number of" +
    --- End diff --
    
    numbers -> neighbors


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the pull request:

    https://github.com/apache/flink/pull/1980#issuecomment-220088196
  
    Tests are passing ... will merge ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63890630
  
    --- Diff: docs/apis/batch/libs/gelly.md ---
    @@ -2051,6 +2052,26 @@ The algorithm takes a directed, vertex (and possibly edge) attributed graph as i
     vertex represents a group of vertices and each edge represents a group of edges from the input graph. Furthermore, each
     vertex and edge in the output graph stores the common group value and the number of represented elements.
     
    +### Jaccard Index
    +
    +#### Overview
    +The Jaccard Index measures the similarity between vertex neighborhoods. Scores range from 0.0 (no common neighbors) to
    +1.0 (all neighbors are common).
    +
    +#### Details
    +Counting common neighbors for pairs of vertices is equivalent to counting the two-paths consisting of two edges
    +connecting the two vertices to the common neighbor. The number of distinct neighbors for pairs of vertices is computed
    --- End diff --
    
    distinct _common_ neighbors?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the pull request:

    https://github.com/apache/flink/pull/1980#issuecomment-220658718
  
    Thanks for the reviews @vasia! Merging ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63892035
  
    --- Diff: docs/apis/batch/libs/gelly.md ---
    @@ -2051,6 +2052,26 @@ The algorithm takes a directed, vertex (and possibly edge) attributed graph as i
     vertex represents a group of vertices and each edge represents a group of edges from the input graph. Furthermore, each
     vertex and edge in the output graph stores the common group value and the number of represented elements.
     
    +### Jaccard Index
    +
    +#### Overview
    +The Jaccard Index measures the similarity between vertex neighborhoods. Scores range from 0.0 (no common neighbors) to
    +1.0 (all neighbors are common).
    +
    +#### Details
    +Counting common neighbors for pairs of vertices is equivalent to counting the two-paths consisting of two edges
    --- End diff --
    
    I've also seen triad to mean any three vertices and triplet to mean any three connected vertices. I'll rework this section.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63894471
  
    --- Diff: flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/similarity/JaccardIndex.java ---
    @@ -0,0 +1,462 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.library.similarity;
    +
    +import org.apache.flink.api.common.ExecutionConfig;
    +import org.apache.flink.api.common.functions.FlatMapFunction;
    +import org.apache.flink.api.common.functions.GroupReduceFunction;
    +import org.apache.flink.api.common.operators.Order;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.functions.FunctionAnnotation;
    +import org.apache.flink.api.java.tuple.Tuple2;
    +import org.apache.flink.api.java.tuple.Tuple3;
    +import org.apache.flink.api.java.tuple.Tuple4;
    +import org.apache.flink.graph.Edge;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.GraphAlgorithm;
    +import org.apache.flink.graph.asm.degree.annotate.undirected.EdgeTargetDegree;
    +import org.apache.flink.graph.library.similarity.JaccardIndex.Result;
    +import org.apache.flink.graph.utils.Murmur3_32;
    +import org.apache.flink.types.CopyableValue;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.util.Collector;
    +import org.apache.flink.util.Preconditions;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +/**
    + * The Jaccard Index measures the similarity between vertex neighborhoods.
    + * Scores range from 0.0 (no common neighbors) to 1.0 (all neighbors are common).
    + * <br/>
    + * This implementation produces similarity scores for each pair of vertices
    + * in the graph with at least one common neighbor; equivalently, this is the
    + * set of all non-zero Jaccard Similarity coefficients.
    + * <br/>
    + * The input graph must be a simple, undirected graph containing no duplicate
    + * edges or self-loops.
    + *
    + * @param <K> graph ID type
    + * @param <VV> vertex value type
    + * @param <EV> edge value type
    + */
    +public class JaccardIndex<K extends CopyableValue<K>, VV, EV>
    +implements GraphAlgorithm<K, VV, EV, DataSet<Result<K>>> {
    +
    +	public static final int DEFAULT_GROUP_SIZE = 64;
    +
    +	// Optional configuration
    +	private int groupSize = DEFAULT_GROUP_SIZE;
    +
    +	private boolean unboundedScores = true;
    +
    +	private int minimumScoreNumerator = 0;
    +
    +	private int minimumScoreDenominator = 1;
    +
    +	private int maximumScoreNumerator = 1;
    +
    +	private int maximumScoreDenominator = 0;
    +
    +	private int littleParallelism = ExecutionConfig.PARALLELISM_UNKNOWN;
    +
    +	/**
    +	 * Override the default group size for the quadratic expansion of neighbor
    +	 * pairs. Small groups generate more data whereas large groups distribute
    +	 * computation less evenly among tasks.
    +	 *
    +	 * @param groupSize the group size for the quadratic expansion of neighbor pairs
    +	 * @return this
    +	 */
    +	public JaccardIndex<K, VV, EV> setGroupSize(int groupSize) {
    +		Preconditions.checkArgument(groupSize > 0, "Group size must be greater than zero");
    +
    +		this.groupSize = groupSize;
    +
    +		return this;
    +	}
    +
    +	/**
    +	 * Filter out Jaccard Index scores less than the given minimum fraction.
    +	 *
    +	 * @param numerator numerator of the minimum score
    +	 * @param denominator denominator of the minimum score
    +	 * @return this
    +	 * @see #setMaximumScore(int, int)
    +	 */
    +	public JaccardIndex<K, VV, EV> setMinimumScore(int numerator, int denominator) {
    +		Preconditions.checkArgument(numerator >= 0, "Minimum score numerator must be non-negative");
    +		Preconditions.checkArgument(denominator > 0, "Minimum score denominator must be greater than zero");
    +		Preconditions.checkArgument(numerator <= denominator, "Minimum score fraction must be less than or equal to one");
    +
    +		this.unboundedScores = false;
    +		this.minimumScoreNumerator = numerator;
    +		this.minimumScoreDenominator = denominator;
    +
    +		return this;
    +	}
    +
    +	/**
    +	 * Filter out Jaccard Index scores greater than or equal to the given maximum fraction.
    +	 *
    +	 * @param numerator numerator of the maximum score
    +	 * @param denominator denominator of the maximum score
    +	 * @return this
    +	 * @see #setMinimumScore(int, int)
    +	 */
    +	public JaccardIndex<K, VV, EV> setMaximumScore(int numerator, int denominator) {
    +		Preconditions.checkArgument(numerator >= 0, "Maximum score numerator must be non-negative");
    +		Preconditions.checkArgument(denominator > 0, "Maximum score denominator must be greater than zero");
    +		Preconditions.checkArgument(numerator <= denominator, "Maximum score fraction must be less than or equal to one");
    +
    +		this.unboundedScores = false;
    +		this.maximumScoreNumerator = numerator;
    +		this.maximumScoreDenominator = denominator;
    +
    +		return this;
    +	}
    +
    +	/**
    +	 * Override the parallelism of operators processing small amounts of data.
    +	 *
    +	 * @param littleParallelism operator parallelism
    +	 * @return this
    +	 */
    +	public JaccardIndex<K, VV, EV> setLittleParallelism(int littleParallelism) {
    +		this.littleParallelism = littleParallelism;
    +
    +		return this;
    +	}
    +
    +	/*
    +	 * Implementation notes:
    +	 *
    +	 * The requirement that "K extends CopyableValue<K>" can be removed when
    +	 *   Flink has a self-join which performs the skew distribution handled by
    +	 *   GenerateGroupSpans / GenerateGroups / GenerateGroupPairs.
    +	 */
    +
    +	@Override
    +	public DataSet<Result<K>> run(Graph<K, VV, EV> input)
    --- End diff --
    
    Can you please add Javadoc for the `Result` type?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63891097
  
    --- Diff: docs/apis/batch/libs/gelly.md ---
    @@ -2051,6 +2052,26 @@ The algorithm takes a directed, vertex (and possibly edge) attributed graph as i
     vertex represents a group of vertices and each edge represents a group of edges from the input graph. Furthermore, each
     vertex and edge in the output graph stores the common group value and the number of represented elements.
     
    +### Jaccard Index
    +
    +#### Overview
    +The Jaccard Index measures the similarity between vertex neighborhoods. Scores range from 0.0 (no common neighbors) to
    +1.0 (all neighbors are common).
    +
    +#### Details
    +Counting common neighbors for pairs of vertices is equivalent to counting the two-paths consisting of two edges
    +connecting the two vertices to the common neighbor. The number of distinct neighbors for pairs of vertices is computed
    +by storing the sum of degrees of the vertex pair and subtracting the count of common neighbors, which are double-counted
    +in the sum of degrees.
    +
    +The algorithm first annotates each edge with the endpoint degree. Grouping on the midpoint vertex, each pair of
    +neighbors is emitted with the endpoint degree sum. Grouping on two-paths, the common neighbors are counted.
    +
    +#### Usage
    +The algorithm takes a simple, undirected graph as input and outputs a `DataSet` of tuples containing two vertex IDs,
    +the number of common neighbors, and the number of distinct neighbors. The graph ID type must be `Comparable` and
    --- End diff --
    
    Why doesn't the algorithm return the jaccard index?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/1980#issuecomment-220118204
  
    Great, that's my plan :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r64037567
  
    --- Diff: flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/asm/degree/annotate/TranslateEdgeDegreeToIntValue.java ---
    @@ -0,0 +1,51 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.asm.degree.annotate;
    +
    +import org.apache.flink.api.java.tuple.Tuple2;
    +import org.apache.flink.graph.asm.translate.TranslateFunction;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +
    +/**
    + * Translate the edge degree returned by the degree annotation functions from
    + * {@link LongValue} to {@link IntValue}.
    + *
    + * @param <T> edge value type
    + */
    +public class TranslateEdgeDegreeToIntValue<T>
    --- End diff --
    
    I thought on this for a while, and decided that the best option is to subsume the translation in the next operation, and discovered that I had already made that change. So the class just needs to be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63901325
  
    --- Diff: flink-libraries/flink-gelly/src/test/java/org/apache/flink/graph/library/similarity/JaccardIndexTest.java ---
    @@ -0,0 +1,136 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.library.similarity;
    +
    +import org.apache.commons.math3.random.JDKRandomGenerator;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.Utils.ChecksumHashCode;
    +import org.apache.flink.api.java.utils.DataSetUtils;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.asm.AsmTestBase;
    +import org.apache.flink.graph.generator.RMatGraph;
    +import org.apache.flink.graph.generator.random.JDKRandomGeneratorFactory;
    +import org.apache.flink.graph.generator.random.RandomGenerableFactory;
    +import org.apache.flink.graph.library.similarity.JaccardIndex.Result;
    +import org.apache.flink.test.util.TestBaseUtils;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.types.NullValue;
    +import org.junit.Test;
    +
    +import static org.junit.Assert.assertEquals;
    +
    +public class JaccardIndexTest
    +extends AsmTestBase {
    +
    +	@Test
    +	public void testSimpleGraph()
    +			throws Exception {
    +		DataSet<Result<IntValue>> ji = undirectedSimpleGraph
    +			.run(new JaccardIndex<IntValue, NullValue, NullValue>());
    +
    +		String expectedResult =
    +			"(0,1,(1,4))\n" +
    +			"(0,2,(1,4))\n" +
    +			"(0,3,(2,4))\n" +
    +			"(1,2,(2,4))\n" +
    +			"(1,3,(1,6))\n" +
    +			"(1,4,(1,3))\n" +
    +			"(1,5,(1,3))\n" +
    +			"(2,3,(1,6))\n" +
    +			"(2,4,(1,3))\n" +
    +			"(2,5,(1,3))\n" +
    +			"(4,5,(1,1))\n";
    +
    +		TestBaseUtils.compareResultAsText(ji.collect(), expectedResult);
    +	}
    +
    +	@Test
    +	public void testSimpleGraphWithMinimumScore()
    +			throws Exception {
    +		DataSet<Result<IntValue>> ji = undirectedSimpleGraph
    +			.run(new JaccardIndex<IntValue, NullValue, NullValue>()
    +				.setMinimumScore(1, 2));
    +
    +		String expectedResult =
    +			"(0,3,(2,4))\n" +
    +			"(1,2,(2,4))\n" +
    +			"(4,5,(1,1))\n";
    +
    +		TestBaseUtils.compareResultAsText(ji.collect(), expectedResult);
    +	}
    +
    +	@Test
    +	public void testSimpleGraphWithMaximumScore()
    +			throws Exception {
    +		DataSet<Result<IntValue>> ji = undirectedSimpleGraph
    +			.run(new JaccardIndex<IntValue, NullValue, NullValue>()
    +				.setMaximumScore(1, 2));
    +
    +		String expectedResult =
    +			"(0,1,(1,4))\n" +
    +			"(0,2,(1,4))\n" +
    +			"(1,3,(1,6))\n" +
    +			"(1,4,(1,3))\n" +
    +			"(1,5,(1,3))\n" +
    +			"(2,3,(1,6))\n" +
    +			"(2,4,(1,3))\n" +
    +			"(2,5,(1,3))\n";
    +
    +		TestBaseUtils.compareResultAsText(ji.collect(), expectedResult);
    +	}
    +
    +	@Test
    +	public void testCompleteGraph()
    +			throws Exception {
    +		DataSet<Result<LongValue>> ji = completeGraph
    +			.run(new JaccardIndex<LongValue, NullValue, NullValue>()
    +				.setGroupSize(4));
    +
    +		for (Result<LongValue> result : ji.collect()) {
    +			// the intersection includes every vertex
    +			assertEquals(completeGraphVertexCount, result.getDistinctNeighborCount().getValue());
    +
    +			// the union only excludes the two vertices from the similarity score
    +			assertEquals(completeGraphVertexCount - 2, result.getCommonNeighborCount().getValue());
    +		}
    +	}
    +
    +	@Test
    +	public void testRMatGraph()
    +			throws Exception {
    +		long vertexCount = 1 << 8;
    +		long edgeCount = 8 * vertexCount;
    +
    +		RandomGenerableFactory<JDKRandomGenerator> rnd = new JDKRandomGeneratorFactory();
    +
    +		Graph<LongValue, NullValue, NullValue> graph = new RMatGraph<>(env, rnd, vertexCount, edgeCount)
    +			.setSimpleGraph(true, false)
    +			.generate();
    +
    +		DataSet<Result<LongValue>> ji = graph
    +			.run(new JaccardIndex<LongValue, NullValue, NullValue>()
    +				.setGroupSize(4));
    +
    +		ChecksumHashCode checksum = DataSetUtils.checksumHashCode(ji);
    +
    +		assertEquals(13954, checksum.getCount());
    +		assertEquals(0x0000179f83a2a873L, checksum.getChecksum());
    --- End diff --
    
    These are regression tests and cover aspects of the algorithms not tested by hand-crafted inputs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63894348
  
    --- Diff: flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/similarity/JaccardIndex.java ---
    @@ -0,0 +1,462 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.library.similarity;
    +
    +import org.apache.flink.api.common.ExecutionConfig;
    +import org.apache.flink.api.common.functions.FlatMapFunction;
    +import org.apache.flink.api.common.functions.GroupReduceFunction;
    +import org.apache.flink.api.common.operators.Order;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.functions.FunctionAnnotation;
    +import org.apache.flink.api.java.tuple.Tuple2;
    +import org.apache.flink.api.java.tuple.Tuple3;
    +import org.apache.flink.api.java.tuple.Tuple4;
    +import org.apache.flink.graph.Edge;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.GraphAlgorithm;
    +import org.apache.flink.graph.asm.degree.annotate.undirected.EdgeTargetDegree;
    +import org.apache.flink.graph.library.similarity.JaccardIndex.Result;
    +import org.apache.flink.graph.utils.Murmur3_32;
    +import org.apache.flink.types.CopyableValue;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.util.Collector;
    +import org.apache.flink.util.Preconditions;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +/**
    + * The Jaccard Index measures the similarity between vertex neighborhoods.
    + * Scores range from 0.0 (no common neighbors) to 1.0 (all neighbors are common).
    + * <br/>
    + * This implementation produces similarity scores for each pair of vertices
    + * in the graph with at least one common neighbor; equivalently, this is the
    + * set of all non-zero Jaccard Similarity coefficients.
    + * <br/>
    + * The input graph must be a simple, undirected graph containing no duplicate
    + * edges or self-loops.
    + *
    + * @param <K> graph ID type
    + * @param <VV> vertex value type
    + * @param <EV> edge value type
    + */
    +public class JaccardIndex<K extends CopyableValue<K>, VV, EV>
    +implements GraphAlgorithm<K, VV, EV, DataSet<Result<K>>> {
    +
    +	public static final int DEFAULT_GROUP_SIZE = 64;
    +
    +	// Optional configuration
    +	private int groupSize = DEFAULT_GROUP_SIZE;
    +
    +	private boolean unboundedScores = true;
    +
    +	private int minimumScoreNumerator = 0;
    +
    +	private int minimumScoreDenominator = 1;
    +
    +	private int maximumScoreNumerator = 1;
    +
    +	private int maximumScoreDenominator = 0;
    +
    +	private int littleParallelism = ExecutionConfig.PARALLELISM_UNKNOWN;
    +
    +	/**
    +	 * Override the default group size for the quadratic expansion of neighbor
    +	 * pairs. Small groups generate more data whereas large groups distribute
    +	 * computation less evenly among tasks.
    +	 *
    +	 * @param groupSize the group size for the quadratic expansion of neighbor pairs
    +	 * @return this
    +	 */
    +	public JaccardIndex<K, VV, EV> setGroupSize(int groupSize) {
    +		Preconditions.checkArgument(groupSize > 0, "Group size must be greater than zero");
    +
    +		this.groupSize = groupSize;
    +
    +		return this;
    +	}
    +
    +	/**
    +	 * Filter out Jaccard Index scores less than the given minimum fraction.
    +	 *
    +	 * @param numerator numerator of the minimum score
    +	 * @param denominator denominator of the minimum score
    +	 * @return this
    +	 * @see #setMaximumScore(int, int)
    +	 */
    +	public JaccardIndex<K, VV, EV> setMinimumScore(int numerator, int denominator) {
    +		Preconditions.checkArgument(numerator >= 0, "Minimum score numerator must be non-negative");
    +		Preconditions.checkArgument(denominator > 0, "Minimum score denominator must be greater than zero");
    +		Preconditions.checkArgument(numerator <= denominator, "Minimum score fraction must be less than or equal to one");
    +
    +		this.unboundedScores = false;
    +		this.minimumScoreNumerator = numerator;
    +		this.minimumScoreDenominator = denominator;
    +
    +		return this;
    +	}
    +
    +	/**
    +	 * Filter out Jaccard Index scores greater than or equal to the given maximum fraction.
    +	 *
    +	 * @param numerator numerator of the maximum score
    +	 * @param denominator denominator of the maximum score
    +	 * @return this
    +	 * @see #setMinimumScore(int, int)
    +	 */
    +	public JaccardIndex<K, VV, EV> setMaximumScore(int numerator, int denominator) {
    --- End diff --
    
    This one also and all other configuration methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/1980


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/1980#issuecomment-220620841
  
    Thanks for the update @greghogan. I left a few comments for improvements of the docs. After these are addressed it should be good to merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r64049343
  
    --- Diff: docs/apis/batch/libs/gelly.md ---
    @@ -2055,22 +2055,22 @@ vertex and edge in the output graph stores the common group value and the number
     ### Jaccard Index
     
     #### Overview
    -The Jaccard Index measures the similarity between vertex neighborhoods. Scores range from 0.0 (no common neighbors) to
    -1.0 (all neighbors are common).
    +The Jaccard Index measures the similarity between vertex neighborhoods and is computed as the number of shared numbers
    +divided by the number of distinct neighbors. Scores range from 0.0 (no shared neighbors) to 1.0 (all neighbors are
    +shared).
     
     #### Details
    -Counting common neighbors for pairs of vertices is equivalent to counting the two-paths consisting of two edges
    -connecting the two vertices to the common neighbor. The number of distinct neighbors for pairs of vertices is computed
    -by storing the sum of degrees of the vertex pair and subtracting the count of common neighbors, which are double-counted
    -in the sum of degrees.
    +Counting shared neighbors for pairs of vertices is equivalent to counting connecting paths of length two. The number of
    +distinct neighbors is computed by storing the sum of degrees of the vertex pair and subtracting the count of shared
    +neighbors, which are double-counted in the sum of degrees.
     
    -The algorithm first annotates each edge with the endpoint degree. Grouping on the midpoint vertex, each pair of
    -neighbors is emitted with the endpoint degree sum. Grouping on two-paths, the common neighbors are counted.
    +The algorithm first annotates each edge with the target vertex's degree. Grouping on the source vertex, each pair of
    +neighbors is emitted with the degree sum. Grouping on two-paths, the shared neighbors are counted.
     
     #### Usage
     The algorithm takes a simple, undirected graph as input and outputs a `DataSet` of tuples containing two vertex IDs,
    -the number of common neighbors, and the number of distinct neighbors. The graph ID type must be `Comparable` and
    -`Copyable`.
    +the number of shared neighbors, and the number of distinct neighbors. The result class provides a method to compute the
    +Jaccard Index score. The graph ID type must be `Comparable` and `Copyable`.
    --- End diff --
    
    Here we should also document what is the output of the algorithm, i.e. the `Result` type and how to get the jaccard similarity out of it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63890887
  
    --- Diff: docs/apis/batch/libs/gelly.md ---
    @@ -2051,6 +2052,26 @@ The algorithm takes a directed, vertex (and possibly edge) attributed graph as i
     vertex represents a group of vertices and each edge represents a group of edges from the input graph. Furthermore, each
     vertex and edge in the output graph stores the common group value and the number of represented elements.
     
    +### Jaccard Index
    +
    +#### Overview
    +The Jaccard Index measures the similarity between vertex neighborhoods. Scores range from 0.0 (no common neighbors) to
    +1.0 (all neighbors are common).
    +
    +#### Details
    +Counting common neighbors for pairs of vertices is equivalent to counting the two-paths consisting of two edges
    +connecting the two vertices to the common neighbor. The number of distinct neighbors for pairs of vertices is computed
    +by storing the sum of degrees of the vertex pair and subtracting the count of common neighbors, which are double-counted
    +in the sum of degrees.
    +
    +The algorithm first annotates each edge with the endpoint degree. Grouping on the midpoint vertex, each pair of
    --- End diff --
    
    Source, target or both endpoints? What is a "midpoint" vertex? Is this standard terminology? It might help to give a small example here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63895157
  
    --- Diff: flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/similarity/JaccardIndex.java ---
    @@ -0,0 +1,462 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.library.similarity;
    +
    +import org.apache.flink.api.common.ExecutionConfig;
    +import org.apache.flink.api.common.functions.FlatMapFunction;
    +import org.apache.flink.api.common.functions.GroupReduceFunction;
    +import org.apache.flink.api.common.operators.Order;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.functions.FunctionAnnotation;
    +import org.apache.flink.api.java.tuple.Tuple2;
    +import org.apache.flink.api.java.tuple.Tuple3;
    +import org.apache.flink.api.java.tuple.Tuple4;
    +import org.apache.flink.graph.Edge;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.GraphAlgorithm;
    +import org.apache.flink.graph.asm.degree.annotate.undirected.EdgeTargetDegree;
    +import org.apache.flink.graph.library.similarity.JaccardIndex.Result;
    +import org.apache.flink.graph.utils.Murmur3_32;
    +import org.apache.flink.types.CopyableValue;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.util.Collector;
    +import org.apache.flink.util.Preconditions;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +/**
    + * The Jaccard Index measures the similarity between vertex neighborhoods.
    + * Scores range from 0.0 (no common neighbors) to 1.0 (all neighbors are common).
    + * <br/>
    + * This implementation produces similarity scores for each pair of vertices
    + * in the graph with at least one common neighbor; equivalently, this is the
    + * set of all non-zero Jaccard Similarity coefficients.
    + * <br/>
    + * The input graph must be a simple, undirected graph containing no duplicate
    + * edges or self-loops.
    + *
    + * @param <K> graph ID type
    + * @param <VV> vertex value type
    + * @param <EV> edge value type
    + */
    +public class JaccardIndex<K extends CopyableValue<K>, VV, EV>
    +implements GraphAlgorithm<K, VV, EV, DataSet<Result<K>>> {
    +
    +	public static final int DEFAULT_GROUP_SIZE = 64;
    +
    +	// Optional configuration
    +	private int groupSize = DEFAULT_GROUP_SIZE;
    +
    +	private boolean unboundedScores = true;
    +
    +	private int minimumScoreNumerator = 0;
    +
    +	private int minimumScoreDenominator = 1;
    +
    +	private int maximumScoreNumerator = 1;
    +
    +	private int maximumScoreDenominator = 0;
    +
    +	private int littleParallelism = ExecutionConfig.PARALLELISM_UNKNOWN;
    +
    +	/**
    +	 * Override the default group size for the quadratic expansion of neighbor
    +	 * pairs. Small groups generate more data whereas large groups distribute
    +	 * computation less evenly among tasks.
    +	 *
    +	 * @param groupSize the group size for the quadratic expansion of neighbor pairs
    +	 * @return this
    +	 */
    +	public JaccardIndex<K, VV, EV> setGroupSize(int groupSize) {
    +		Preconditions.checkArgument(groupSize > 0, "Group size must be greater than zero");
    +
    +		this.groupSize = groupSize;
    +
    +		return this;
    +	}
    +
    +	/**
    +	 * Filter out Jaccard Index scores less than the given minimum fraction.
    +	 *
    +	 * @param numerator numerator of the minimum score
    +	 * @param denominator denominator of the minimum score
    +	 * @return this
    +	 * @see #setMaximumScore(int, int)
    +	 */
    +	public JaccardIndex<K, VV, EV> setMinimumScore(int numerator, int denominator) {
    +		Preconditions.checkArgument(numerator >= 0, "Minimum score numerator must be non-negative");
    +		Preconditions.checkArgument(denominator > 0, "Minimum score denominator must be greater than zero");
    +		Preconditions.checkArgument(numerator <= denominator, "Minimum score fraction must be less than or equal to one");
    +
    +		this.unboundedScores = false;
    +		this.minimumScoreNumerator = numerator;
    +		this.minimumScoreDenominator = denominator;
    +
    +		return this;
    +	}
    +
    +	/**
    +	 * Filter out Jaccard Index scores greater than or equal to the given maximum fraction.
    +	 *
    +	 * @param numerator numerator of the maximum score
    +	 * @param denominator denominator of the maximum score
    +	 * @return this
    +	 * @see #setMinimumScore(int, int)
    +	 */
    +	public JaccardIndex<K, VV, EV> setMaximumScore(int numerator, int denominator) {
    +		Preconditions.checkArgument(numerator >= 0, "Maximum score numerator must be non-negative");
    +		Preconditions.checkArgument(denominator > 0, "Maximum score denominator must be greater than zero");
    +		Preconditions.checkArgument(numerator <= denominator, "Maximum score fraction must be less than or equal to one");
    +
    +		this.unboundedScores = false;
    +		this.maximumScoreNumerator = numerator;
    +		this.maximumScoreDenominator = denominator;
    +
    +		return this;
    +	}
    +
    +	/**
    +	 * Override the parallelism of operators processing small amounts of data.
    +	 *
    +	 * @param littleParallelism operator parallelism
    +	 * @return this
    +	 */
    +	public JaccardIndex<K, VV, EV> setLittleParallelism(int littleParallelism) {
    +		this.littleParallelism = littleParallelism;
    +
    +		return this;
    +	}
    +
    +	/*
    +	 * Implementation notes:
    +	 *
    +	 * The requirement that "K extends CopyableValue<K>" can be removed when
    +	 *   Flink has a self-join which performs the skew distribution handled by
    +	 *   GenerateGroupSpans / GenerateGroups / GenerateGroupPairs.
    +	 */
    +
    +	@Override
    +	public DataSet<Result<K>> run(Graph<K, VV, EV> input)
    +			throws Exception {
    +		// s, t, d(t)
    +		DataSet<Edge<K, Tuple2<EV, LongValue>>> neighborDegree = input
    +			.run(new EdgeTargetDegree<K, VV, EV>()
    +				.setParallelism(littleParallelism));
    +
    +		// group span, s, t, d(t)
    +		DataSet<Tuple4<IntValue, K, K, IntValue>> groupSpans = neighborDegree
    +			.groupBy(0)
    +			.sortGroup(1, Order.ASCENDING)
    +			.reduceGroup(new GenerateGroupSpans<K, EV>(groupSize))
    +				.setParallelism(littleParallelism)
    +				.name("Generate group spans");
    +
    +		// group, s, t, d(t)
    +		DataSet<Tuple4<IntValue, K, K, IntValue>> groups = groupSpans
    +			.rebalance()
    +				.setParallelism(littleParallelism)
    +				.name("Rebalance")
    +			.flatMap(new GenerateGroups<K>())
    +				.setParallelism(littleParallelism)
    +				.name("Generate groups");
    +
    +		// t, u, d(t)+d(u)
    +		DataSet<Tuple3<K, K, IntValue>> twoPaths = groups
    +			.groupBy(0, 1)
    +			.sortGroup(2, Order.ASCENDING)
    +			.reduceGroup(new GenerateGroupPairs<K>(groupSize))
    +				.name("Generate group pairs");
    +
    +		// t, u, intersection, union
    +		return twoPaths
    +			.groupBy(0, 1)
    +			.reduceGroup(new ComputeScores<K>(unboundedScores,
    +					minimumScoreNumerator, minimumScoreDenominator,
    +					maximumScoreNumerator, maximumScoreDenominator))
    +				.name("Compute scores");
    +	}
    +
    +	/**
    +	 * This is the first of three operations implementing a self-join to generate
    +	 * the full neighbor pairing for each vertex. The number of neighbor pairs
    +	 * is (n choose 2) which is quadratic in the vertex degree.
    +	 * <br/>
    +	 * The third operation, {@link GenerateGroupPairs}, processes groups of size
    +	 * {@link #groupSize} and emits {@code O(groupSize * deg(vertex))} pairs.
    +	 * <br/>
    +	 * This input to the third operation is still quadratic in the vertex degree.
    +	 * Two prior operations, {@link GenerateGroupSpans} and {@link GenerateGroups},
    +	 * each emit datasets linear in the vertex degree, with a forced rebalance
    +	 * in between. {@link GenerateGroupSpans} first annotates each edge with the
    +	 * number of groups and {@link GenerateGroups} emits each edge into each group.
    --- End diff --
    
    \U0001f44d for the detailed comment!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r64048955
  
    --- Diff: flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/similarity/JaccardIndex.java ---
    @@ -43,11 +43,13 @@
     import java.util.List;
     
     /**
    - * The Jaccard Index measures the similarity between vertex neighborhoods.
    - * Scores range from 0.0 (no common neighbors) to 1.0 (all neighbors are common).
    + * The Jaccard Index measures the similarity between vertex neighborhoods and
    + * is computed as the number of shared numbers divided by the number of
    --- End diff --
    
    numbers -> neighbors


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63896459
  
    --- Diff: flink-libraries/flink-gelly/src/test/java/org/apache/flink/graph/library/similarity/JaccardIndexTest.java ---
    @@ -0,0 +1,136 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.library.similarity;
    +
    +import org.apache.commons.math3.random.JDKRandomGenerator;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.Utils.ChecksumHashCode;
    +import org.apache.flink.api.java.utils.DataSetUtils;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.asm.AsmTestBase;
    +import org.apache.flink.graph.generator.RMatGraph;
    +import org.apache.flink.graph.generator.random.JDKRandomGeneratorFactory;
    +import org.apache.flink.graph.generator.random.RandomGenerableFactory;
    +import org.apache.flink.graph.library.similarity.JaccardIndex.Result;
    +import org.apache.flink.test.util.TestBaseUtils;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.types.NullValue;
    +import org.junit.Test;
    +
    +import static org.junit.Assert.assertEquals;
    +
    +public class JaccardIndexTest
    +extends AsmTestBase {
    +
    +	@Test
    +	public void testSimpleGraph()
    +			throws Exception {
    +		DataSet<Result<IntValue>> ji = undirectedSimpleGraph
    +			.run(new JaccardIndex<IntValue, NullValue, NullValue>());
    +
    +		String expectedResult =
    +			"(0,1,(1,4))\n" +
    +			"(0,2,(1,4))\n" +
    +			"(0,3,(2,4))\n" +
    +			"(1,2,(2,4))\n" +
    +			"(1,3,(1,6))\n" +
    +			"(1,4,(1,3))\n" +
    +			"(1,5,(1,3))\n" +
    +			"(2,3,(1,6))\n" +
    +			"(2,4,(1,3))\n" +
    +			"(2,5,(1,3))\n" +
    +			"(4,5,(1,1))\n";
    +
    +		TestBaseUtils.compareResultAsText(ji.collect(), expectedResult);
    +	}
    +
    +	@Test
    +	public void testSimpleGraphWithMinimumScore()
    +			throws Exception {
    +		DataSet<Result<IntValue>> ji = undirectedSimpleGraph
    +			.run(new JaccardIndex<IntValue, NullValue, NullValue>()
    +				.setMinimumScore(1, 2));
    +
    +		String expectedResult =
    +			"(0,3,(2,4))\n" +
    +			"(1,2,(2,4))\n" +
    +			"(4,5,(1,1))\n";
    +
    +		TestBaseUtils.compareResultAsText(ji.collect(), expectedResult);
    +	}
    +
    +	@Test
    +	public void testSimpleGraphWithMaximumScore()
    +			throws Exception {
    +		DataSet<Result<IntValue>> ji = undirectedSimpleGraph
    +			.run(new JaccardIndex<IntValue, NullValue, NullValue>()
    +				.setMaximumScore(1, 2));
    +
    +		String expectedResult =
    +			"(0,1,(1,4))\n" +
    +			"(0,2,(1,4))\n" +
    +			"(1,3,(1,6))\n" +
    +			"(1,4,(1,3))\n" +
    +			"(1,5,(1,3))\n" +
    +			"(2,3,(1,6))\n" +
    +			"(2,4,(1,3))\n" +
    +			"(2,5,(1,3))\n";
    +
    +		TestBaseUtils.compareResultAsText(ji.collect(), expectedResult);
    +	}
    +
    +	@Test
    +	public void testCompleteGraph()
    +			throws Exception {
    +		DataSet<Result<LongValue>> ji = completeGraph
    +			.run(new JaccardIndex<LongValue, NullValue, NullValue>()
    +				.setGroupSize(4));
    +
    +		for (Result<LongValue> result : ji.collect()) {
    +			// the intersection includes every vertex
    +			assertEquals(completeGraphVertexCount, result.getDistinctNeighborCount().getValue());
    +
    +			// the union only excludes the two vertices from the similarity score
    +			assertEquals(completeGraphVertexCount - 2, result.getCommonNeighborCount().getValue());
    +		}
    +	}
    +
    +	@Test
    +	public void testRMatGraph()
    +			throws Exception {
    +		long vertexCount = 1 << 8;
    +		long edgeCount = 8 * vertexCount;
    +
    +		RandomGenerableFactory<JDKRandomGenerator> rnd = new JDKRandomGeneratorFactory();
    +
    +		Graph<LongValue, NullValue, NullValue> graph = new RMatGraph<>(env, rnd, vertexCount, edgeCount)
    +			.setSimpleGraph(true, false)
    +			.generate();
    +
    +		DataSet<Result<LongValue>> ji = graph
    +			.run(new JaccardIndex<LongValue, NullValue, NullValue>()
    +				.setGroupSize(4));
    +
    +		ChecksumHashCode checksum = DataSetUtils.checksumHashCode(ji);
    +
    +		assertEquals(13954, checksum.getCount());
    +		assertEquals(0x0000179f83a2a873L, checksum.getChecksum());
    --- End diff --
    
    How did you compute these values?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by vasia <gi...@git.apache.org>.
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/1980#issuecomment-220103434
  
    Hey @greghogan,
    it's usually a good idea to have at least one person look at the code :)
    Is this blocking some other feature? If not, I could take a look at this PR tomorrow if you like.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3780] [gelly] Jaccard Similarity

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r63956013
  
    --- Diff: flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/similarity/JaccardIndex.java ---
    @@ -0,0 +1,462 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.graph.library.similarity;
    +
    +import org.apache.flink.api.common.ExecutionConfig;
    +import org.apache.flink.api.common.functions.FlatMapFunction;
    +import org.apache.flink.api.common.functions.GroupReduceFunction;
    +import org.apache.flink.api.common.operators.Order;
    +import org.apache.flink.api.java.DataSet;
    +import org.apache.flink.api.java.functions.FunctionAnnotation;
    +import org.apache.flink.api.java.tuple.Tuple2;
    +import org.apache.flink.api.java.tuple.Tuple3;
    +import org.apache.flink.api.java.tuple.Tuple4;
    +import org.apache.flink.graph.Edge;
    +import org.apache.flink.graph.Graph;
    +import org.apache.flink.graph.GraphAlgorithm;
    +import org.apache.flink.graph.asm.degree.annotate.undirected.EdgeTargetDegree;
    +import org.apache.flink.graph.library.similarity.JaccardIndex.Result;
    +import org.apache.flink.graph.utils.Murmur3_32;
    +import org.apache.flink.types.CopyableValue;
    +import org.apache.flink.types.IntValue;
    +import org.apache.flink.types.LongValue;
    +import org.apache.flink.util.Collector;
    +import org.apache.flink.util.Preconditions;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +/**
    + * The Jaccard Index measures the similarity between vertex neighborhoods.
    + * Scores range from 0.0 (no common neighbors) to 1.0 (all neighbors are common).
    + * <br/>
    + * This implementation produces similarity scores for each pair of vertices
    + * in the graph with at least one common neighbor; equivalently, this is the
    + * set of all non-zero Jaccard Similarity coefficients.
    + * <br/>
    + * The input graph must be a simple, undirected graph containing no duplicate
    + * edges or self-loops.
    + *
    + * @param <K> graph ID type
    + * @param <VV> vertex value type
    + * @param <EV> edge value type
    + */
    +public class JaccardIndex<K extends CopyableValue<K>, VV, EV>
    +implements GraphAlgorithm<K, VV, EV, DataSet<Result<K>>> {
    +
    +	public static final int DEFAULT_GROUP_SIZE = 64;
    +
    +	// Optional configuration
    +	private int groupSize = DEFAULT_GROUP_SIZE;
    +
    +	private boolean unboundedScores = true;
    +
    +	private int minimumScoreNumerator = 0;
    +
    +	private int minimumScoreDenominator = 1;
    +
    +	private int maximumScoreNumerator = 1;
    +
    +	private int maximumScoreDenominator = 0;
    +
    +	private int littleParallelism = ExecutionConfig.PARALLELISM_UNKNOWN;
    +
    +	/**
    +	 * Override the default group size for the quadratic expansion of neighbor
    +	 * pairs. Small groups generate more data whereas large groups distribute
    +	 * computation less evenly among tasks.
    +	 *
    +	 * @param groupSize the group size for the quadratic expansion of neighbor pairs
    +	 * @return this
    +	 */
    +	public JaccardIndex<K, VV, EV> setGroupSize(int groupSize) {
    --- End diff --
    
    I don't know that 64 is the best value but I think it is close. I clarified the javadoc to suggest that users not change this value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---