You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/09/07 17:24:34 UTC

[GitHub] [lucene] mayya-sharipova opened a new pull request #287: Optimize storage of neighbours on level > 0

mayya-sharipova opened a new pull request #287:
URL: https://github.com/apache/lucene/pull/287


   Currently we are storing neighbourhoods of all nodes for all levels,
   even though level > 0 contain fewer nodes.
   
   This patch fixes this by:
   - graph stores for a level a list of neighbourhoods only for nodes
     present on this level. Neighbours on all levels are presented
     as level 0's ordinals.
   - nodesByLevel defines which nodes are stored on this level, presented
     as level 0's nodes' ordinals.
   - to get neighbours for a given node for level > 0, we first do a binary
     search on nodesByLevel for this level to find node's index in the
     graph's lists of neighbourhoods for this level. We then use this
     index to retrieve a desired neighbourhood.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a change in pull request #287: Optimize storage of neighbours on level > 0

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on a change in pull request #287:
URL: https://github.com/apache/lucene/pull/287#discussion_r705421087



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,21 +58,27 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int curMaxLevel; // the current max graph level
+  private int entryNode; // the current graph entry node on the top level
+
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
   // graph is a list of graph levels.
   // Each level is represented as List<NeighborArray> – nodes' connections on this level.
   // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
   // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
   private final List<List<NeighborArray>> graph;
-  private int curMaxLevel; // the current max graph level
-  private int entryNode; // the current graph entry node on the top level
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
   // used for iterating over graph values
   private int curLevel = -1;
-  private int curNode = -1;
+  private int curNodeOrd = -1;

Review comment:
       @jtibshirani Great comment, indeed it was confusing. In 2c02db0c5dc0b290f7eb4d1b9c9e3d8d2691f364:
   -  In `KnnGraphValues` I removed `seekLevel(level)`, and instead adopted the previous `seek(level, target)` method.
   - In `HnswGraph`  I removed `curNodeOrd` and `curLevel` fields, we don't need them anymore. Added `getAllNodesOnLevel(level)` that allows to iterate over nodes on the given level. This is needed both for tests, and also later in `Lucene90HnswVectorsWriter`  when storing a graph on disk.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova merged pull request #287: Optimize storage of neighbours on level > 0

Posted by GitBox <gi...@apache.org>.
mayya-sharipova merged pull request #287:
URL: https://github.com/apache/lucene/pull/287


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a change in pull request #287: Optimize storage of neighbours on level > 0

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on a change in pull request #287:
URL: https://github.com/apache/lucene/pull/287#discussion_r705530357



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -334,4 +324,43 @@ public int maxLevel() {
   public int entryNode() {
     return entryNode;
   }
+
+  /**
+   * Get all nodes on a given level as node 0th ordinals
+   *
+   * @param level level for which to get all nodes
+   * @return an iterator over nodes where {@code nextDoc} returns a next node
+   */
+  public DocIdSetIterator getAllNodesOnLevel(int level) {

Review comment:
       Is `DocIdSetIterator` the right choice? I've only seen us use this when the output is really doc IDs. From searching through the repo, I didn't see a common alternative though.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a change in pull request #287: Optimize storage of neighbours on level > 0

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on a change in pull request #287:
URL: https://github.com/apache/lucene/pull/287#discussion_r705570925



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -334,4 +324,43 @@ public int maxLevel() {
   public int entryNode() {
     return entryNode;
   }
+
+  /**
+   * Get all nodes on a given level as node 0th ordinals
+   *
+   * @param level level for which to get all nodes
+   * @return an iterator over nodes where {@code nextDoc} returns a next node
+   */
+  public DocIdSetIterator getAllNodesOnLevel(int level) {

Review comment:
       @jtibshirani I was not happy with `DocIdSetIterator` as well.  We can use Java's `Iterator<Integer>`, but this would require boxing/unboxing between primitives.  Or create another class? 
   
   I guess may be for now keep `DocIdSetIterator`, and later can think for something better (I can add TODO for it). WDYT? 
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a change in pull request #287: Optimize storage of neighbours on level > 0

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on a change in pull request #287:
URL: https://github.com/apache/lucene/pull/287#discussion_r705631815



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -334,4 +324,43 @@ public int maxLevel() {
   public int entryNode() {
     return entryNode;
   }
+
+  /**
+   * Get all nodes on a given level as node 0th ordinals
+   *
+   * @param level level for which to get all nodes
+   * @return an iterator over nodes where {@code nextDoc} returns a next node
+   */
+  public DocIdSetIterator getAllNodesOnLevel(int level) {

Review comment:
       Addressed in c7a696e83d26a66598ff7ea2ec2107c0eae47bed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a change in pull request #287: Optimize storage of neighbours on level > 0

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on a change in pull request #287:
URL: https://github.com/apache/lucene/pull/287#discussion_r705570925



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -334,4 +324,43 @@ public int maxLevel() {
   public int entryNode() {
     return entryNode;
   }
+
+  /**
+   * Get all nodes on a given level as node 0th ordinals
+   *
+   * @param level level for which to get all nodes
+   * @return an iterator over nodes where {@code nextDoc} returns a next node
+   */
+  public DocIdSetIterator getAllNodesOnLevel(int level) {

Review comment:
       @jtibshirani I was not happy with `DocIdSetIterator` as well.  We can use Java's `Iterator<Integer>`, but this would require boxing/unboxing between primitives and Integer objects.  Or create another class? 
   
   I guess may be for now keep `DocIdSetIterator`, and later can think for something better (I can add TODO for it). WDYT? 
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a change in pull request #287: Optimize storage of neighbours on level > 0

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on a change in pull request #287:
URL: https://github.com/apache/lucene/pull/287#discussion_r705416239



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -230,36 +244,50 @@ static NeighborQueue searchLevel(
    * Returns the {@link NeighborQueue} connected to the given node.
    *
    * @param level level of the graph
-   * @param node the node whose neighbors are returned
+   * @param node the node whose neighbors are returned, represented as an ordinal on the level 0.
    */
   public NeighborArray getNeighbors(int level, int node) {
-    NeighborArray result = graph.get(level).get(node);
-    assert result != null;
-    return result;
+    if (level == 0) {
+      return graph.get(level).get(node);
+    }
+    int nodeOrd = Arrays.binarySearch(nodesByLevel.get(level), 0, graph.get(level).size(), node);

Review comment:
       Addressed in 2c02db0c5dc0b290f7eb4d1b9c9e3d8d2691f364




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a change in pull request #287: Optimize storage of neighbours on level > 0

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on a change in pull request #287:
URL: https://github.com/apache/lucene/pull/287#discussion_r703752457



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -230,36 +244,50 @@ static NeighborQueue searchLevel(
    * Returns the {@link NeighborQueue} connected to the given node.
    *
    * @param level level of the graph
-   * @param node the node whose neighbors are returned
+   * @param node the node whose neighbors are returned, represented as an ordinal on the level 0.
    */
   public NeighborArray getNeighbors(int level, int node) {
-    NeighborArray result = graph.get(level).get(node);
-    assert result != null;
-    return result;
+    if (level == 0) {
+      return graph.get(level).get(node);
+    }
+    int nodeOrd = Arrays.binarySearch(nodesByLevel.get(level), 0, graph.get(level).size(), node);

Review comment:
       Would a clearer name for this be `nodeIndex`? I think of `nodeOrd` as representing the ordinal on level 0.

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,21 +58,27 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int curMaxLevel; // the current max graph level
+  private int entryNode; // the current graph entry node on the top level
+
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
   // graph is a list of graph levels.
   // Each level is represented as List<NeighborArray> – nodes' connections on this level.
   // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
   // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
   private final List<List<NeighborArray>> graph;
-  private int curMaxLevel; // the current max graph level
-  private int entryNode; // the current graph entry node on the top level
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
   // used for iterating over graph values
   private int curLevel = -1;
-  private int curNode = -1;
+  private int curNodeOrd = -1;

Review comment:
       This is not directly related to this PR, but I am finding the seeking behavior hard to follow. We have two pieces of state that seem similar: `cur` for the current node's NeighborArray, and `curNodeOrd` which is a node index within a level. These can get out of sync depending on what methods are called.
   
   I wonder if we could remove `curNodeOrd` -- I think it is only used to support `nextNodeOnLevel()`, which in turn is only used in tests. Maybe we could replace it with a package-private method for testing like `getAllNodesOnLevel(level)?

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,21 +58,27 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int curMaxLevel; // the current max graph level
+  private int entryNode; // the current graph entry node on the top level
+
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
   // graph is a list of graph levels.
   // Each level is represented as List<NeighborArray> – nodes' connections on this level.
   // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
   // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
   private final List<List<NeighborArray>> graph;
-  private int curMaxLevel; // the current max graph level
-  private int entryNode; // the current graph entry node on the top level
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
   // used for iterating over graph values
   private int curLevel = -1;
-  private int curNode = -1;
+  private int curNodeOrd = -1;

Review comment:
       This is not directly related to this PR, but I am finding the seeking behavior hard to follow. We have two pieces of state that seem similar: `cur` for the current node's NeighborArray, and `curNodeOrd` which is a node index within a level. These can get out of sync depending on what methods are called.
   
   I wonder if we could remove `curNodeOrd` -- I think it is only used to support `nextNodeOnLevel()`, which in turn is only used in tests. Maybe we could replace it with a package-private method for testing like `getAllNodesOnLevel(level)`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a change in pull request #287: Optimize storage of neighbours on level > 0

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on a change in pull request #287:
URL: https://github.com/apache/lucene/pull/287#discussion_r705601152



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -334,4 +324,43 @@ public int maxLevel() {
   public int entryNode() {
     return entryNode;
   }
+
+  /**
+   * Get all nodes on a given level as node 0th ordinals
+   *
+   * @param level level for which to get all nodes
+   * @return an iterator over nodes where {@code nextDoc} returns a next node
+   */
+  public DocIdSetIterator getAllNodesOnLevel(int level) {

Review comment:
       Adding a TODO sounds good to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org