You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/05/02 16:18:03 UTC

[GitHub] [lucene] mayya-sharipova opened a new pull request, #862: LUCENE-9848 Sort HNSW graph neighbors for construction

mayya-sharipova opened a new pull request, #862:
URL: https://github.com/apache/lucene/pull/862

   Sort HNSW graph neighbors when applying diversity criterion
   
   During HNSW graph construction, when a node has already a number of
   connections larger than maximum allowed (maxConn), we need to prune
   its connections using a diversity criteria to limit the number of
   connections to maxConn.
   
   Currently when we add reverse connections to already existing nodes,
   we don't keep them sorted. Thus later, when we apply diversity criteria
   we may prune not the worst most distant non-diverse nodes.
   
   This patch makes sure that neighbours connections are always sorted
   from best (closest) to worst (distant), and during the application
   of diversity criteria processes nodes from worst to best.
   
   This path does the following:
   - enhance NeighborArray to always keep neighbour nodes sorted according
     to their scores (in desc or asc order). Make NeighborArray aware in
     which order the nodes should be sorted.
   - make OnHeapHnswGraph aware of the order of similarity function
   - make HnswGraphBuilder apply diversity criteria from worst to
     best nodes
   - create Lucene90NeighborArray to keep the previous logic of
     NeighborArray for Lucene90Codec


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #862: LUCENE-9848 Sort HNSW graph neighbors for construction

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #862:
URL: https://github.com/apache/lucene/pull/862#discussion_r864407369


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##########
@@ -72,8 +104,38 @@ public void removeLast() {
     size--;
   }
 
+  public void removeIndex(int idx) {
+    for (int i = idx; i < (size - 1); i++) {
+      node[i] = node[i + 1];
+      score[i] = score[i + 1];
+    }
+    size--;
+  }
+
   @Override
   public String toString() {
     return "NeighborArray[" + size + "]";
   }
+
+  private int ascSortFindRightMostInsertionPoint(float newScore) {
+    int start = 0;

Review Comment:
   Great comment, addressed in: 160904904c94ffd4d194fc2509124c0e2eb9c44a



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##########
@@ -72,8 +104,38 @@ public void removeLast() {
     size--;
   }
 
+  public void removeIndex(int idx) {
+    for (int i = idx; i < (size - 1); i++) {
+      node[i] = node[i + 1];

Review Comment:
   Great comment, addressed in: 160904904c94ffd4d194fc2509124c0e2eb9c44a



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #862: LUCENE-9848 Sort HNSW graph neighbors for construction

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #862:
URL: https://github.com/apache/lucene/pull/862#issuecomment-1116854352

   @msokolov Thanks so much for your feedback. I've addressed it in 160904904c94ffd4d194fc2509124c0e2eb9c44a.
   
   I've also did another set of benchmarking with new changes, this time with higher `M` and `efConstruction` values:
   
   glove-100-angular M:32 efConstruction:500
   
   Happy to report that indexing took just slightly more (3%), while for searches reported slight gains in both recall and QPS:
   
   **Indexing Speed**
   baseline: Indexed 1183514 documents in 3697s
   candidate Indexed 1183514 documents in 3814s
   
   
   **Search**
   
   |             | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ----------- | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10  |           0.634 |     2628.879 |            0.659 |      2659.589 |
   | n_cands=20  |           0.696 |     2034.672 |            0.719 |      2186.015 |
   | n_cands=40  |           0.762 |     1611.403 |            0.784 |      1666.735 |
   | n_cands=80  |           0.824 |     1150.786 |            0.843 |      1161.111 |
   | n_cands=120 |           0.855 |      878.114 |            0.874 |       890.783 |
   | n_cands=200 |           0.892 |      610.891 |            0.907 |       621.173 |
   | n_cands=400 |           0.931 |      343.649 |            0.944 |       359.924 |
   | n_cands=600 |           0.949 |      239.217 |            0.960 |       256.863 |
   | n_cands=800 |           0.961 |      189.523 |            0.970 |       199.867 |
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on a diff in pull request #862: LUCENE-9848 Sort HNSW graph neighbors for construction

Posted by GitBox <gi...@apache.org>.
msokolov commented on code in PR #862:
URL: https://github.com/apache/lucene/pull/862#discussion_r863041170


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##########
@@ -21,32 +21,64 @@
 
 /**
  * NeighborArray encodes the neighbors of a node and their mutual scores in the HNSW graph as a pair
- * of growable arrays.
+ * of growable arrays. Nodes are arranged in the sorted order of their scores in descending order
+ * (if scoresDescOrder is true), or in the ascending order of their scores (if scoresDescOrder is
+ * false)
  *
  * @lucene.internal
  */
 public class NeighborArray {
-
+  private final boolean scoresDescOrder;
   private int size;
 
   float[] score;
   int[] node;
 
-  public NeighborArray(int maxSize) {
+  public NeighborArray(int maxSize, boolean descOrder) {
     node = new int[maxSize];
     score = new float[maxSize];
+    this.scoresDescOrder = descOrder;
   }
 
+  /**
+   * Add a new node to the NeighborArray. The new node must be worse than all previously stored
+   * nodes.
+   */
   public void add(int newNode, float newScore) {
     if (size == node.length - 1) {
       node = ArrayUtil.grow(node, (size + 1) * 3 / 2);
       score = ArrayUtil.growExact(score, node.length);
     }
+    if (size > 0) {
+      float previousScore = score[size - 1];
+      assert ((scoresDescOrder && (previousScore >= newScore))
+              || (scoresDescOrder == false && (previousScore <= newScore)))
+          : "Nodes are added in the incorrect order!";
+    }
     node[size] = newNode;
     score[size] = newScore;
     ++size;
   }
 
+  /** Add a new node to the NeighborArray into a correct sort position according to its score. */
+  public void addAndSort(int newNode, float newScore) {
+    if (size == node.length - 1) {
+      node = ArrayUtil.grow(node, (size + 1) * 3 / 2);
+      score = ArrayUtil.growExact(score, node.length);
+    }
+    int insertionPoint =
+        scoresDescOrder
+            ? descSortFindRightMostInsertionPoint(newScore)
+            : ascSortFindRightMostInsertionPoint(newScore);
+    for (int i = size; i > insertionPoint; i--) {

Review Comment:
   Can we we use `System.arraycopy`? Does it work in reverse order when it needs to?



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##########
@@ -72,8 +104,38 @@ public void removeLast() {
     size--;
   }
 
+  public void removeIndex(int idx) {
+    for (int i = idx; i < (size - 1); i++) {
+      node[i] = node[i + 1];

Review Comment:
   `System.arraycopy` should definitely be safe here



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##########
@@ -72,8 +104,38 @@ public void removeLast() {
     size--;
   }
 
+  public void removeIndex(int idx) {
+    for (int i = idx; i < (size - 1); i++) {
+      node[i] = node[i + 1];
+      score[i] = score[i + 1];
+    }
+    size--;
+  }
+
   @Override
   public String toString() {
     return "NeighborArray[" + size + "]";
   }
+
+  private int ascSortFindRightMostInsertionPoint(float newScore) {
+    int start = 0;

Review Comment:
   I think we can use `Arrays.binarySearch`, although sadly not for the inverse sort case



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##########
@@ -21,32 +21,64 @@
 
 /**
  * NeighborArray encodes the neighbors of a node and their mutual scores in the HNSW graph as a pair
- * of growable arrays.
+ * of growable arrays. Nodes are arranged in the sorted order of their scores in descending order
+ * (if scoresDescOrder is true), or in the ascending order of their scores (if scoresDescOrder is
+ * false)
  *
  * @lucene.internal
  */
 public class NeighborArray {
-
+  private final boolean scoresDescOrder;
   private int size;
 
   float[] score;
   int[] node;
 
-  public NeighborArray(int maxSize) {
+  public NeighborArray(int maxSize, boolean descOrder) {
     node = new int[maxSize];
     score = new float[maxSize];
+    this.scoresDescOrder = descOrder;
   }
 
+  /**
+   * Add a new node to the NeighborArray. The new node must be worse than all previously stored
+   * nodes.
+   */
   public void add(int newNode, float newScore) {
     if (size == node.length - 1) {
       node = ArrayUtil.grow(node, (size + 1) * 3 / 2);
       score = ArrayUtil.growExact(score, node.length);
     }
+    if (size > 0) {
+      float previousScore = score[size - 1];
+      assert ((scoresDescOrder && (previousScore >= newScore))
+              || (scoresDescOrder == false && (previousScore <= newScore)))
+          : "Nodes are added in the incorrect order!";
+    }
     node[size] = newNode;
     score[size] = newScore;
     ++size;
   }
 
+  /** Add a new node to the NeighborArray into a correct sort position according to its score. */
+  public void addAndSort(int newNode, float newScore) {

Review Comment:
   I might rename this `insertSorted` since it is not really re-sorting?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #862: LUCENE-9848 Sort HNSW graph neighbors for construction

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #862:
URL: https://github.com/apache/lucene/pull/862#discussion_r865134562


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##########
@@ -72,8 +103,39 @@ public void removeLast() {
     size--;
   }
 
+  public void removeIndex(int idx) {
+    System.arraycopy(node, idx + 1, node, idx, size - idx);
+    System.arraycopy(score, idx + 1, score, idx, size - idx);
+    size--;
+  }
+
   @Override
   public String toString() {
     return "NeighborArray[" + size + "]";
   }
+
+  private int ascSortFindRightMostInsertionPoint(float newScore) {
+    int insertionPoint = Arrays.binarySearch(score, 0, size, newScore);
+    if (insertionPoint >= 0) {
+      // find the right most position with the same score

Review Comment:
   Thanks for a suggestion, it is indeed a good suggestion, I have not thought about this.  But considering it is relatively rare to have duplicates in vectors, I opt for keeping the current code. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #862: LUCENE-9848 Sort HNSW graph neighbors for construction

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #862:
URL: https://github.com/apache/lucene/pull/862#issuecomment-1115093505

   I've run benchmarks on 2 datasets:
   baseline  – main branch
   candidate – this PR:
   
   Indexing time was the same or slightly better.
   Recall slightly increased: 1-3%
   QPS slightly decreased
   
   ---
   **glove-100-angular M:16 efConstruction:100**
   
   
   |             | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ----------- | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10  |           0.463 |     3742.506 |            0.476 |      3743.109 |
   | n_cands=20  |           0.529 |     3191.042 |            0.541 |      2954.978 |
   | n_cands=40  |           0.603 |     2576.537 |            0.617 |      2681.529 |
   | n_cands=80  |           0.682 |     1877.982 |            0.696 |      1629.359 |
   | n_cands=120 |           0.721 |     1492.808 |            0.737 |      1447.522 |
   | n_cands=200 |           0.766 |     1099.154 |            0.784 |      1093.390 |
   | n_cands=400 |           0.818 |      672.911 |            0.834 |       667.136 |
   | n_cands=600 |           0.845 |      489.292 |            0.859 |       478.109 |
   | n_cands=800 |           0.862 |      383.402 |            0.875 |       383.488 |
   
   
   ---
   **sift-128-euclidean  M:16 efConstruction:500**
   
   |             | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ----------- | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10  |           0.747 |     3891.531 |            0.753 |      3304.833 |
   | n_cands=20  |           0.817 |     3359.364 |            0.819 |      2879.256 |
   | n_cands=40  |           0.889 |     2590.605 |            0.887 |      2280.737 |
   | n_cands=80  |           0.944 |     1798.558 |            0.941 |      1648.794 |
   | n_cands=120 |           0.964 |     1383.721 |            0.963 |      1377.711 |
   | n_cands=200 |           0.983 |      973.862 |            0.981 |       917.995 |
   | n_cands=400 |           0.994 |      586.816 |            0.993 |       548.358 |
   | n_cands=600 |           0.997 |      427.128 |            0.996 |       408.959 |
   | n_cands=800 |           0.998 |      341.178 |            0.997 |       316.840 |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova merged pull request #862: LUCENE-9848 Sort HNSW graph neighbors for construction

Posted by GitBox <gi...@apache.org>.
mayya-sharipova merged PR #862:
URL: https://github.com/apache/lucene/pull/862


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #862: LUCENE-9848 Sort HNSW graph neighbors for construction

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #862:
URL: https://github.com/apache/lucene/pull/862#discussion_r864407239


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##########
@@ -21,32 +21,64 @@
 
 /**
  * NeighborArray encodes the neighbors of a node and their mutual scores in the HNSW graph as a pair
- * of growable arrays.
+ * of growable arrays. Nodes are arranged in the sorted order of their scores in descending order
+ * (if scoresDescOrder is true), or in the ascending order of their scores (if scoresDescOrder is
+ * false)
  *
  * @lucene.internal
  */
 public class NeighborArray {
-
+  private final boolean scoresDescOrder;
   private int size;
 
   float[] score;
   int[] node;
 
-  public NeighborArray(int maxSize) {
+  public NeighborArray(int maxSize, boolean descOrder) {
     node = new int[maxSize];
     score = new float[maxSize];
+    this.scoresDescOrder = descOrder;
   }
 
+  /**
+   * Add a new node to the NeighborArray. The new node must be worse than all previously stored
+   * nodes.
+   */
   public void add(int newNode, float newScore) {
     if (size == node.length - 1) {
       node = ArrayUtil.grow(node, (size + 1) * 3 / 2);
       score = ArrayUtil.growExact(score, node.length);
     }
+    if (size > 0) {
+      float previousScore = score[size - 1];
+      assert ((scoresDescOrder && (previousScore >= newScore))
+              || (scoresDescOrder == false && (previousScore <= newScore)))
+          : "Nodes are added in the incorrect order!";
+    }
     node[size] = newNode;
     score[size] = newScore;
     ++size;
   }
 
+  /** Add a new node to the NeighborArray into a correct sort position according to its score. */
+  public void addAndSort(int newNode, float newScore) {
+    if (size == node.length - 1) {
+      node = ArrayUtil.grow(node, (size + 1) * 3 / 2);
+      score = ArrayUtil.growExact(score, node.length);
+    }
+    int insertionPoint =
+        scoresDescOrder
+            ? descSortFindRightMostInsertionPoint(newScore)
+            : ascSortFindRightMostInsertionPoint(newScore);
+    for (int i = size; i > insertionPoint; i--) {

Review Comment:
   TIL: that we can use System.arraycopy for the reverse case as well.  Addressed in: 160904904c94ffd4d194fc2509124c0e2eb9c44a



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##########
@@ -21,32 +21,64 @@
 
 /**
  * NeighborArray encodes the neighbors of a node and their mutual scores in the HNSW graph as a pair
- * of growable arrays.
+ * of growable arrays. Nodes are arranged in the sorted order of their scores in descending order
+ * (if scoresDescOrder is true), or in the ascending order of their scores (if scoresDescOrder is
+ * false)
  *
  * @lucene.internal
  */
 public class NeighborArray {
-
+  private final boolean scoresDescOrder;
   private int size;
 
   float[] score;
   int[] node;
 
-  public NeighborArray(int maxSize) {
+  public NeighborArray(int maxSize, boolean descOrder) {
     node = new int[maxSize];
     score = new float[maxSize];
+    this.scoresDescOrder = descOrder;
   }
 
+  /**
+   * Add a new node to the NeighborArray. The new node must be worse than all previously stored
+   * nodes.
+   */
   public void add(int newNode, float newScore) {
     if (size == node.length - 1) {
       node = ArrayUtil.grow(node, (size + 1) * 3 / 2);
       score = ArrayUtil.growExact(score, node.length);
     }
+    if (size > 0) {
+      float previousScore = score[size - 1];
+      assert ((scoresDescOrder && (previousScore >= newScore))
+              || (scoresDescOrder == false && (previousScore <= newScore)))
+          : "Nodes are added in the incorrect order!";
+    }
     node[size] = newNode;
     score[size] = newScore;
     ++size;
   }
 
+  /** Add a new node to the NeighborArray into a correct sort position according to its score. */
+  public void addAndSort(int newNode, float newScore) {

Review Comment:
   Addressed in: 160904904c94ffd4d194fc2509124c0e2eb9c44a



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on a diff in pull request #862: LUCENE-9848 Sort HNSW graph neighbors for construction

Posted by GitBox <gi...@apache.org>.
msokolov commented on code in PR #862:
URL: https://github.com/apache/lucene/pull/862#discussion_r865039829


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##########
@@ -72,8 +103,39 @@ public void removeLast() {
     size--;
   }
 
+  public void removeIndex(int idx) {
+    System.arraycopy(node, idx + 1, node, idx, size - idx);
+    System.arraycopy(score, idx + 1, score, idx, size - idx);
+    size--;
+  }
+
   @Override
   public String toString() {
     return "NeighborArray[" + size + "]";
   }
+
+  private int ascSortFindRightMostInsertionPoint(float newScore) {
+    int insertionPoint = Arrays.binarySearch(score, 0, size, newScore);
+    if (insertionPoint >= 0) {
+      // find the right most position with the same score

Review Comment:
   oh, tricky. Maybe we could search with newScore + (smallest float)? But this is fine and is probably occurring very rarely/never anyway



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org