You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/10/27 17:17:25 UTC

[GitHub] [lucene] mayya-sharipova opened a new pull request #416: LUCENE-10054 Make HnswGraph hierarchical

mayya-sharipova opened a new pull request #416:
URL: https://github.com/apache/lucene/pull/416


   Currently HNSW has only a single layer.
   This patch attempts to make multi-layered.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r783194063



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsFormat.java
##########
@@ -54,13 +63,19 @@
  *   <li><b>[int32]</b> vector similarity function ordinal
  *   <li><b>[vlong]</b> offset to this field's vectors in the .vec file
  *   <li><b>[vlong]</b> length of this field's vectors, in bytes
- *   <li><b>[vlong]</b> offset to this field's index in the .vex file
+ *   <li><b>[vlong]</b> to this field's index in the .vex file

Review comment:
       Accidental removal, reinstated 5e42f220c9c35ed0ec567bd2b760c6cbe21671f4




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r770048710



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =

Review comment:
       The issue is proving tricky to resolve and personally I don't feel like it needs to block merging. We could look into improving it in a follow-up PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r751671158



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =

Review comment:
       Yes, indeed we populate `FieldEntry` from the graph index (levels) file. 
   
   > It feels a bit surprising that FieldEntry is constructed across two different files
   
   Indeed, I was not happy with this as well. Alternatively, we can put all this information about levels and layers into a single meta file as it is currently done.  Happy to hear other suggestions as well.
   
   > It also means FieldEntry isn't immutable
   
   No, `FieldEntry` is immutable, as all its fields are final, they are initialized and filled in the `Lucene90HnswVectorsReader`  constructor. 
   
   In  https://github.com/apache/lucene/pull/315 was trying to fill graph data lazily on the first use, but it looks like we don't like lazy loading.
   
   
   > Maybe we could keep the file as-is but create a new class to hold the graph index, something like GraphLevels?
   
   Is `GraphLevels` instead of `FieldEntry`?  I like the approach of a single `FieldEntry` for every field, as it follows how other field types are organized.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r758678766



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
+    }
+
+    this.nodesByLevel = new ArrayList<>(numLevels);
+    nodesByLevel.add(null); // we don't need this for 0th level, as it contains all nodes
+    for (int l = 1; l < numLevels; l++) {
+      nodesByLevel.add(new int[] {0});
+    }
   }
 
   /**
-   * Searches for the nearest neighbors of a query vector.
+   * Searches HNSW graph for the nearest neighbors of a query vector.
    *
    * @param query search query vector
    * @param topK the number of nodes to be returned
-   * @param numSeed the size of the queue maintained while searching, and controls the number of
-   *     random entry points to sample
+   * @param numSeed the size of the queue maintained while searching

Review comment:
       It works for me to have a separate discussion. Maybe at least in this PR we can rename this to `numCandidates`, since the 'seed' naming no longer makes sense?
   
   As context, I still think it makes sense to remove the `numCandidates` vs. `k` distinction in `HnswGraph`. The public signature `KnnVectorsReader#search` does not include a notion of "num candidates", so users have no way to even use this distinction. I'd be in favor of removing it from `HnswGraph`, then having a follow-up discussion about whether the vector search APIs should handle `numCandidates` vs `k`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r751687763



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
+    }
+
+    this.nodesByLevel = new ArrayList<>(numLevels);
+    nodesByLevel.add(null); // we don't need this for 0th level, as it contains all nodes
+    for (int l = 1; l < numLevels; l++) {
+      nodesByLevel.add(new int[] {0});
+    }
   }
 
   /**
-   * Searches for the nearest neighbors of a query vector.
+   * Searches HNSW graph for the nearest neighbors of a query vector.
    *
    * @param query search query vector
    * @param topK the number of nodes to be returned
-   * @param numSeed the size of the queue maintained while searching, and controls the number of
-   *     random entry points to sample
+   * @param numSeed the size of the queue maintained while searching

Review comment:
       @jtibshirani  Thanks for the comment, I have thought long about this, and I think we should keep it until we have a better strategy.
   
   Currently, we search HNSW graphs  through `KnnVectorQuery` providing k as `numCandidates` (`numSeed`  or `fanout`), as this query doesn't support  `numCandidates`  parameter.  I think it is not optimal that instead of returning let's say top 10 docs from every segment (or shard in a distributed search), we return 100 (200, 800 etc).
   
   Also the signature of `KnnVectorQuery` and `HnswGraph::search` need to be updated to reflect the fact that  `k` is indeed the number of candidates we are considering. 
   
   May be can discuss this outside this PR, whether and where it makes sense to have `numCandidates`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r751673717



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -99,32 +121,72 @@ public static NeighborQueue search(
       Bits acceptOrds,
       SplittableRandom random)
       throws IOException {
+
     int size = graphValues.size();
+    int boundedNumSeed = Math.max(topK, Math.min(numSeed, 2 * size));
+    NeighborQueue results;
+
+    int[] eps = new int[] {graphValues.entryNode()};
+    for (int level = graphValues.numLevels() - 1; level >= 1; level--) {
+      results =
+          HnswGraph.searchLevel(

Review comment:
       Addressed in 2961b1cdffb7df2744084712178757fe7e65257c




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova edited a comment on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova edited a comment on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953142944


   Benchmarking based on @jtibshirani [setup](based on https://github.com/jtibshirani/lucene/pull/1)
   
   baseline: main branch
   candidate: this PR
   
   **glove-25-angular**
   |              | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ------------ | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10   |           0.626 |    10962.821 |            0.631 |      8869.807 |
   | n_cands=50   |           0.888 |     4409.952 |            0.889 |      4111.685 |
   | n_cands=100  |           0.946 |     2621.846 |            0.947 |      2734.787 |
   | n_cands=500  |           0.994 |      661.253 |            0.994 |       686.700 |
   | n_cands=800  |           0.997 |      430.172 |            0.997 |       459.356 |
   | n_cands=1000 |           0.998 |      342.915 |            0.998 |       355.238 |
   
   
   **sift-128-euclidean**
   
   |              | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ------------ | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10   |           0.601 |     6948.736 |            0.607 |      6677.931 |
   | n_cands=50   |           0.889 |     3003.781 |            0.892 |      3202.925 |
   | n_cands=100  |           0.952 |     1622.276 |            0.953 |      1996.992 |
   | n_cands=500  |           0.996 |      444.135 |            0.996 |       540.368 |
   | n_cands=800  |           0.998 |      296.835 |            0.998 |       367.316 |
   | n_cands=1000 |           0.999 |      245.498 |            0.999 |       311.339 |
   
   
   As can be seen from the comparison, there is very slight change that the hierarchy brings: a small increase in recall by at the expense of lower QPSs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-972170715


   @jtibshirani Thanks for your extensive review. I've tried to address some comments, and some comments I think we still need to discuss
   
   > does it have an impact on indexing speed?
   
   I've run another benchmarking experiment with `KnnGraphTester` , and can be seed below there seems to be a small impact on indexing speed:  6% increase in indexing time;.
   
   #### Comparison of glove-100
   
   baseline: main branch
   candidate: this PR
   
   | Baseline                                                     | candidate                                                    |
   | ------------------------------------------------------------ | ------------------------------------------------------------ |
   | built 1180000 in 23061/2280885 ms                            | built 1180000 in 23113/2295566 ms                            |
   | flushed: segment=_1 ramUsed=474.061 MB newFlushedSize=513.284 MB docs/MB=2,305.768 | flushed: segment=_0 ramUsed=474.061 MB newFlushedSize=517.153 MB docs/MB=2,288.517 |
   
   
   
   |             | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ----------- | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10  |           0.482 |     4199.190 |            0.496 |      3982.133 |
   | n_cands=20  |           0.553 |     3563.974 |            0.561 |      3412.251 |
   | n_cands=40  |           0.631 |     2713.881 |            0.635 |      2455.465 |
   | n_cands=80  |           0.707 |     1902.787 |            0.709 |      1863.012 |
   | n_cands=120 |           0.748 |     1497.979 |            0.747 |      1400.822 |
   | n_cands=200 |           0.790 |     1031.353 |            0.791 |      1018.244 |
   | n_cands=400 |           0.840 |      577.106 |            0.841 |       600.762 |
   | n_cands=600 |           0.865 |      417.961 |            0.865 |       436.488 |
   | n_cands=800 |           0.881 |      319.900 |            0.881 |       334.503 |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953198759


   > As can be seen from the comparison, there is very slight change that the hierarchy brings: a small increase in recall by at the expense of lower QPSs
   
   It looks like QPS is sometimes worse, but often better (like in all the sift-128-euclidean runs, num_cands >=50). I wonder if the first runs are affected by a lack of warm-up? My original set-up you linked to didn't include a warm-up, and in LUCENE-9937 we found that this can have a big impact on the first runs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-997145470


   @jtibshirani Thanks for continuing your review.
   
   My main questions to @jpountz and @msokolov is indeed about file formats.
   
   Is it better to keep the current design of 3 files: .vem (metadata), .vec (vectors), and .vex (graphs neighbours), and extra information about graph levels into the metadata file? This make metadata file even bigger, but will allow to load all info for `FieldEntry` from a single meta file.
   
   Or is it better to break the current `metadata` file into 2 files 1) .vem - a smaller field description file and 2) a bigger .vel file about graph levels and neighbours' offsets?  This make metadata file small, but `FieldEntry` is loaded from two files.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r783342986



##########
File path: lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java
##########
@@ -116,33 +116,59 @@ public KnnVectorsFormat getKnnVectorsFormatForField(String field) {
                               ((CodecReader) ctx.reader()).getVectorReader())
                           .getFieldReader("field"))
                   .getGraphValues("field");
-          assertGraphEqual(hnsw, graphValues, nVec);
+          assertGraphEqual(hnsw, graphValues);
         }
       }
     }
   }
 
+  private void assertGraphEqual(KnnGraphValues g, KnnGraphValues h) throws IOException {
+    assertEquals("the number of levels in the graphs are different!", g.numLevels(), h.numLevels());
+    assertEquals("the number of nodes in the graphs are different!", g.size(), h.size());
+
+    // assert equal nodes on each level
+    for (int level = 0; level < g.numLevels(); level++) {
+      NodesIterator nodesOnLevel = g.getNodesOnLevel(level);
+      NodesIterator nodesOnLevel2 = h.getNodesOnLevel(level);
+      while (nodesOnLevel.hasNext() && nodesOnLevel2.hasNext()) {
+        int node = nodesOnLevel.nextInt();
+        int node2 = nodesOnLevel2.nextInt();
+        assertEquals("nodes in the graphs are different", node, node2);
+      }
+    }
+
+    // assert equal nodes' neighbours on each level
+    for (int level = 0; level < g.numLevels(); level++) {
+      NodesIterator nodesOnLevel = g.getNodesOnLevel(level);
+      while (nodesOnLevel.hasNext()) {
+        int node = nodesOnLevel.nextInt();
+        g.seek(level, node);
+        h.seek(level, node);
+        assertEquals("arcs differ for node " + node, getNeighborNodes(g), getNeighborNodes(h));
+      }
+    }
+  }
+
   // Make sure we actually approximately find the closest k elements. Mostly this is about
   // ensuring that we have all the distance functions, comparators, priority queues and so on
   // oriented in the right directions
   public void testAknnDiverse() throws IOException {
+    int maxConn = 10;
     int nDoc = 100;
-    CircularVectorValues vectors = new CircularVectorValues(nDoc);
+    TestHnswGraph.CircularVectorValues vectors = new TestHnswGraph.CircularVectorValues(nDoc);

Review comment:
       Tiny comment, don't need these qualifiers

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,75 +57,124 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
+    }
+
+    this.nodesByLevel = new ArrayList<>(numLevels);
+    nodesByLevel.add(null); // we don't need this for 0th level, as it contains all nodes
+    for (int l = 1; l < numLevels; l++) {
+      nodesByLevel.add(new int[] {0});
+    }
   }
 
   /**
-   * Searches for the nearest neighbors of a query vector.
+   * Searches HNSW graph for the nearest neighbors of a query vector.
    *
    * @param query search query vector
    * @param topK the number of nodes to be returned
-   * @param numSeed the size of the queue maintained while searching, and controls the number of
-   *     random entry points to sample
    * @param vectors vector values
    * @param graphValues the graph values. May represent the entire graph, or a level in a
    *     hierarchical graph.
    * @param acceptOrds {@link Bits} that represents the allowed document ordinals to match, or
    *     {@code null} if they are all allowed to match.
-   * @param random a source of randomness, used for generating entry points to the graph
    * @return a priority queue holding the closest neighbors found
    */
   public static NeighborQueue search(
       float[] query,
       int topK,
-      int numSeed,
       RandomAccessVectorValues vectors,
       VectorSimilarityFunction similarityFunction,
       KnnGraphValues graphValues,
-      Bits acceptOrds,
-      SplittableRandom random)
+      Bits acceptOrds)
       throws IOException {
+
     int size = graphValues.size();
+    int queueSize = Math.min(topK, 2 * size);
+    NeighborQueue results;
+
+    int[] eps = new int[] {graphValues.entryNode()};
+    for (int level = graphValues.numLevels() - 1; level >= 1; level--) {
+      results = searchLevel(query, 1, level, eps, vectors, similarityFunction, graphValues, null);
+      eps[0] = results.pop();
+    }
+    results =
+        searchLevel(query, queueSize, 0, eps, vectors, similarityFunction, graphValues, acceptOrds);
+    return results;
+  }
+
+  /**
+   * Searches for the nearest neighbors of a query vector in a given level
+   *
+   * @param query search query vector
+   * @param topK the number of nearest to query results to return
+   * @param level level to search
+   * @param eps the entry points for search at this level expressed as level 0th ordinals
+   * @param vectors vector values
+   * @param similarityFunction similarity function
+   * @param graphValues the graph values
+   * @param acceptOrds {@link Bits} that represents the allowed document ordinals to match, or
+   *     {@code null} if they are all allowed to match.
+   * @return a priority queue holding the closest neighbors found
+   */
+  static NeighborQueue searchLevel(
+      float[] query,
+      int topK,
+      int level,
+      final int[] eps,
+      RandomAccessVectorValues vectors,
+      VectorSimilarityFunction similarityFunction,
+      KnnGraphValues graphValues,
+      Bits acceptOrds)
+      throws IOException {
 
+    int size = graphValues.size();
+    int queueSize = Math.max(eps.length, topK);

Review comment:
       Same question here, do we need `queueSize` or could it just be `topK`? From the paper it looks like `eps.length` will always be bounded by `topK`?

##########
File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java
##########
@@ -279,6 +303,77 @@ public void testSearch() throws Exception {
     }
   }
 
+  private void indexData(IndexWriter iw) throws IOException {
+    // Add a document for every cartesian point in an NxN square so we can
+    // easily know which are the nearest neighbors to every point. Insert by iterating
+    // using a prime number that is not a divisor of N*N so that we will hit each point once,
+    // and chosen so that points will be inserted in a deterministic
+    // but somewhat distributed pattern
+    int n = 5, stepSize = 17;
+    float[][] values = new float[n * n][];
+    int index = 0;
+    for (int i = 0; i < values.length; i++) {
+      // System.out.printf("%d: (%d, %d)\n", i, index % n, index / n);
+      int x = index % n, y = index / n;
+      values[i] = new float[] {x, y};
+      index = (index + stepSize) % (n * n);
+      add(iw, i, values[i]);
+      if (i == 13) {
+        // create 2 segments
+        iw.commit();
+      }
+    }
+    boolean forceMerge = random().nextBoolean();
+    if (forceMerge) {
+      iw.forceMerge(1);
+    }
+    assertConsistentGraph(iw, values);
+  }
+
+  public void testMultiThreadedSearch() throws Exception {
+    similarityFunction = VectorSimilarityFunction.EUCLIDEAN;
+    IndexWriterConfig config = newIndexWriterConfig();
+    config.setCodec(codec);
+    Directory dir = newDirectory();
+    IndexWriter iw = new IndexWriter(dir, config);
+    indexData(iw);
+
+    final SearcherManager manager = new SearcherManager(iw, new SearcherFactory());
+    Thread[] threads = new Thread[randomIntBetween(2, 5)];
+    final CountDownLatch latch = new CountDownLatch(1);
+    for (int i = 0; i < threads.length; i++) {
+      threads[i] =
+          new Thread(
+              () -> {
+                try {
+                  latch.await();
+                  IndexSearcher searcher = manager.acquire();
+                  try {
+                    KnnVectorQuery query = new KnnVectorQuery("vector", new float[] {0f, 0.1f}, 5);

Review comment:
       Tiny comment, should we use `assertGraphSearch` or `doKnnSearch` here like we do elsewhere in the test?

##########
File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java
##########
@@ -153,21 +161,56 @@ public void testMergeProducesSameGraph() throws Exception {
     int dimension = atLeast(10);
     float[][] values = randomVectors(numDoc, dimension);
     int mergePoint = random().nextInt(numDoc);
-    int[][] mergedGraph = getIndexedGraph(values, mergePoint, seed);
-    int[][] singleSegmentGraph = getIndexedGraph(values, -1, seed);
+    int[][][] mergedGraph = getIndexedGraph(values, mergePoint, seed);
+    int[][][] singleSegmentGraph = getIndexedGraph(values, -1, seed);
     assertGraphEquals(singleSegmentGraph, mergedGraph);
   }
 
-  private void assertGraphEquals(int[][] expected, int[][] actual) {
+  /** Test writing and reading of multiple vector fields * */
+  public void testMultipleVectorFields() throws Exception {
+    int numVectorFields = randomIntBetween(2, 5);
+    int numDoc = atLeast(100);
+    int[] dims = new int[numVectorFields];
+    float[][][] values = new float[numVectorFields][][];
+    for (int field = 0; field < numVectorFields; field++) {
+      dims[field] = atLeast(3);
+      values[field] = randomVectors(numDoc, dims[field]);
+    }
+
+    try (Directory dir = newDirectory();
+        IndexWriter iw = new IndexWriter(dir, newIndexWriterConfig(null).setCodec(codec))) {
+      for (int docID = 0; docID < numDoc; docID++) {
+        Document doc = new Document();
+        for (int field = 0; field < numVectorFields; field++) {
+          float[] vector = values[field][docID];
+          if (vector != null) {
+            FieldType fieldType = KnnVectorField.createFieldType(vector.length, similarityFunction);
+            doc.add(new KnnVectorField(KNN_GRAPH_FIELD + field, vector, fieldType));
+          }
+        }
+        String idString = Integer.toString(docID);
+        doc.add(new StringField("id", idString, Field.Store.YES));
+        iw.addDocument(doc);
+      }
+      for (int field = 0; field < numVectorFields; field++) {
+        assertConsistentGraph(iw, values[field], KNN_GRAPH_FIELD + field);
+      }
+    }
+  }
+
+  private void assertGraphEquals(int[][][] expected, int[][][] actual) {
     assertEquals("graph sizes differ", expected.length, actual.length);
-    for (int i = 0; i < expected.length; i++) {
-      assertArrayEquals("difference at ord=" + i, expected[i], actual[i]);
+    for (int level = 0; level < expected.length; level++) {
+      for (int node = 0; node < expected[level].length; node++) {
+        assertArrayEquals("difference at ord=" + node, expected[level][node], actual[level][node]);
+      }
     }
   }
 
-  private int[][] getIndexedGraph(float[][] values, int mergePoint, long seed) throws IOException {
+  private int[][][] getIndexedGraph(float[][] values, int mergePoint, long seed)

Review comment:
       Super small comment, it'd be nice if javadoc described what was returned here (it's getting to be hard to follow with three array dimensions).

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -40,10 +41,10 @@
  * <h2>Hyperparameters</h2>
  *
  * <ul>
- *   <li><code>numSeed</code> is the equivalent of <code>m</code> in the 2012 paper; it controls the
+ *   <li><code>numSeed</code> is the equivalent of <code>m</code> in the 2014 paper; it controls the

Review comment:
       Should we update this javadoc comment now that we implement the algorithm from the 2018 paper? We could remove the link to the 2014 paper and this reference to `numSeed`?

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,75 +57,124 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
+    }
+
+    this.nodesByLevel = new ArrayList<>(numLevels);
+    nodesByLevel.add(null); // we don't need this for 0th level, as it contains all nodes
+    for (int l = 1; l < numLevels; l++) {
+      nodesByLevel.add(new int[] {0});
+    }
   }
 
   /**
-   * Searches for the nearest neighbors of a query vector.
+   * Searches HNSW graph for the nearest neighbors of a query vector.
    *
    * @param query search query vector
    * @param topK the number of nodes to be returned
-   * @param numSeed the size of the queue maintained while searching, and controls the number of
-   *     random entry points to sample
    * @param vectors vector values
    * @param graphValues the graph values. May represent the entire graph, or a level in a
    *     hierarchical graph.
    * @param acceptOrds {@link Bits} that represents the allowed document ordinals to match, or
    *     {@code null} if they are all allowed to match.
-   * @param random a source of randomness, used for generating entry points to the graph
    * @return a priority queue holding the closest neighbors found
    */
   public static NeighborQueue search(
       float[] query,
       int topK,
-      int numSeed,
       RandomAccessVectorValues vectors,
       VectorSimilarityFunction similarityFunction,
       KnnGraphValues graphValues,
-      Bits acceptOrds,
-      SplittableRandom random)
+      Bits acceptOrds)
       throws IOException {
+
     int size = graphValues.size();
+    int queueSize = Math.min(topK, 2 * size);

Review comment:
       Do we need a separate `queueSize` parameter or can this be `topK`? We already bound `topK` by the number of vectors in `Lucene90HsnwVectorsReader#search`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

jpountz commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r782157204



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsFormat.java
##########
@@ -54,13 +63,19 @@
  *   <li><b>[int32]</b> vector similarity function ordinal
  *   <li><b>[vlong]</b> offset to this field's vectors in the .vec file
  *   <li><b>[vlong]</b> length of this field's vectors, in bytes
- *   <li><b>[vlong]</b> offset to this field's index in the .vex file
+ *   <li><b>[vlong]</b> to this field's index in the .vex file

Review comment:
       why did you remove `offset`?

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsFormat.java
##########
@@ -35,14 +35,23 @@
  * <p>This file stores all the floating-point vector data ordered by field, document ordinal, and
  * vector dimension. The floats are stored in little-endian byte order.
  *
- * <h2>.vex (vector index) file</h2>
+ * <h2>.vex (vector index)</h2>
  *
- * <p>Stores graphs connecting the documents for each field. For each document having a vector for a
- * given field, this is stored as:
+ * <p>Stores graphs connecting the documents for each field organized as a list of nodes' neighbours
+ * as following:
  *
  * <ul>
- *   <li><b>[int32]</b> the number of neighbor nodes
- *   <li><b>array[vint]</b> the neighbor ordinals, delta-encoded (initially subtracting -1)
+ *   <li>For each level:
+ *       <ul>
+ *         <li>For each node:
+ *             <ul>
+ *               <li><b>[int32]</b> the number of neighbor nodes
+ *               <li><b>array[int32]</b> the neighbor ordinals
+ *               <li><b>array[int32]</b> padding from empty integers if the number of neigbors less

Review comment:
       ```suggestion
    *               <li><b>array[int32]</b> padding from empty integers if the number of neighbors less
   ```

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -464,30 +477,45 @@ private void readValue(int targetOrd) throws IOException {
   /** Read the nearest-neighbors graph from the index input */
   private static final class IndexedKnnGraphReader extends KnnGraphValues {
 
-    final FieldEntry entry;
     final IndexInput dataIn;
+    final int[][] nodesByLevel;
+    final long[] graphOffsetsByLevel;
+    final int numLevels;
+    final int entryNode;
+    final int size;
+    final long bytesForConns;
 
     int arcCount;
     int arcUpTo;
     int arc;
 
     IndexedKnnGraphReader(FieldEntry entry, IndexInput dataIn) {
-      this.entry = entry;
       this.dataIn = dataIn;
+      this.nodesByLevel = entry.nodesByLevel;
+      this.numLevels = entry.numLevels;
+      this.entryNode = numLevels > 1 ? nodesByLevel[numLevels - 1][0] : 0;
+      this.size = entry.size();
+      this.graphOffsetsByLevel = entry.graphOffsetsByLevel;
+      this.bytesForConns = ((long) entry.maxConn + 1) * 4;
     }
 
     @Override
-    public void seek(int targetOrd) throws IOException {
+    public void seek(int level, int targetOrd) throws IOException {
+      int targetIndex =
+          level == 0
+              ? targetOrd
+              : Arrays.binarySearch(nodesByLevel[level], 0, nodesByLevel[level].length, targetOrd);
+      long graphDataOffset = graphOffsetsByLevel[level] + targetIndex * bytesForConns;

Review comment:
       maybe check that `targetIndex` is actually positive since binarySearch is allowed to return negative results for missing values

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -464,30 +477,45 @@ private void readValue(int targetOrd) throws IOException {
   /** Read the nearest-neighbors graph from the index input */
   private static final class IndexedKnnGraphReader extends KnnGraphValues {
 
-    final FieldEntry entry;
     final IndexInput dataIn;
+    final int[][] nodesByLevel;
+    final long[] graphOffsetsByLevel;
+    final int numLevels;
+    final int entryNode;
+    final int size;
+    final long bytesForConns;
 
     int arcCount;
     int arcUpTo;
     int arc;
 
     IndexedKnnGraphReader(FieldEntry entry, IndexInput dataIn) {
-      this.entry = entry;
       this.dataIn = dataIn;
+      this.nodesByLevel = entry.nodesByLevel;
+      this.numLevels = entry.numLevels;
+      this.entryNode = numLevels > 1 ? nodesByLevel[numLevels - 1][0] : 0;
+      this.size = entry.size();
+      this.graphOffsetsByLevel = entry.graphOffsetsByLevel;
+      this.bytesForConns = ((long) entry.maxConn + 1) * 4;

Review comment:
       Maybe use `Integer.BYTES` rathen than `4`, it would make it clearer where this 4x factor is coming from?

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -147,7 +137,6 @@ private static IndexInput openDataInput(
               + versionVectorData,
           in);
     }
-    checksumRef[0] = CodecUtil.retrieveChecksum(in);

Review comment:
       Even if we don't do anything with the checksum, we should keep calling `retrieveChecksum`, which helps make sure that the codec footer has the correct structure. See this bit from `Lucene90CompressingStoredFieldsReader` for instance that does this:
   
   ```java
         // NOTE: data file is too costly to verify checksum against all the bytes on open,
         // but for now we at least verify proper structure of the checksum footer: which looks
         // for FOOTER_MAGIC + algorithmID. This is cheap and can detect some forms of corruption
         // such as file truncation.
         CodecUtil.retrieveChecksum(fieldsStream);
   ```

##########
File path: lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java
##########
@@ -32,25 +33,41 @@
   protected KnnGraphValues() {}
 
   /**
-   * Move the pointer to exactly {@code target}, the id of a node in the graph. After this method
+   * Move the pointer to exactly the given {@code level}'s {@code target}. After this method
    * returns, call {@link #nextNeighbor()} to return successive (ordered) connected node ordinals.
    *
-   * @param target must be a valid node in the graph, ie. &ge; 0 and &lt; {@link
+   * @param level level of the graph
+   * @param target ordinal of a node in the graph, must be &ge; 0 and &lt; {@link
    *     VectorValues#size()}.
    */
-  public abstract void seek(int target) throws IOException;
+  public abstract void seek(int level, int target) throws IOException;
 
   /** Returns the number of nodes in the graph */
   public abstract int size();
 
   /**
    * Iterates over the neighbor list. It is illegal to call this method after it returns
-   * NO_MORE_DOCS without calling {@link #seek(int)}, which resets the iterator.
+   * NO_MORE_DOCS without calling {@link #seek(int, int)}, which resets the iterator.
    *
    * @return a node ordinal in the graph, or NO_MORE_DOCS if the iteration is complete.
    */
   public abstract int nextNeighbor() throws IOException;
 
+  /** Returns the number of levels of the graph */
+  public abstract int numLevels() throws IOException;
+
+  /** Returns graph's entry point on the top level * */
+  public abstract int entryNode() throws IOException;
+
+  /**
+   * Get all nodes on a given level as node 0th ordinals
+   *
+   * @param level level for which to get all nodes
+   * @return an iterator over nodes where {@code nextDoc} returns a next node ordinal
+   */
+  // TODO: return a more suitable iterator over nodes than DocIdSetIterator

Review comment:
       Would `IntStream` be a good fit? Or `PrimitiveIterator.OfInt`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r740706179



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -99,32 +121,72 @@ public static NeighborQueue search(
       Bits acceptOrds,
       SplittableRandom random)

Review comment:
       I think this `SplittableRandom` is unused.

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -99,32 +121,72 @@ public static NeighborQueue search(
       Bits acceptOrds,
       SplittableRandom random)
       throws IOException {
+
     int size = graphValues.size();
+    int boundedNumSeed = Math.max(topK, Math.min(numSeed, 2 * size));
+    NeighborQueue results;
+
+    int[] eps = new int[] {graphValues.entryNode()};
+    for (int level = graphValues.numLevels() - 1; level >= 1; level--) {
+      results =
+          HnswGraph.searchLevel(

Review comment:
       Tiny comment, we can remove the `HnswGraph` qualifier.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =

Review comment:
       If I understand correctly, the graph index file is only used to populate these data structures on `FieldEntry` (the graph level information). It feels a bit surprising that `FieldEntry` is constructed across two different files (both the metadata and graph index files). It also means `FieldEntry` isn't immutable.
   
   hmm, I'm not sure what typically belongs in a metadata file. I assume it doesn't make sense to move the graph index information there, alongside the ord -> doc mapping. Maybe we could keep the file as-is but create a new class to hold the graph index, something like `GraphLevels` ?

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));

Review comment:
       I am a little confused why we are careful not to overallocate here, but in `addNode` we do `new NeighborArray(maxConn + 1)));`? I see you maintained the existing behavior, so not critical to address in this PR, I was just curious about it.

##########
File path: lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java
##########
@@ -32,25 +33,41 @@
   protected KnnGraphValues() {}
 
   /**
-   * Move the pointer to exactly {@code target}, the id of a node in the graph. After this method
+   * Move the pointer to exactly the given {@code level}'s {@code target}. After this method
    * returns, call {@link #nextNeighbor()} to return successive (ordered) connected node ordinals.
    *
-   * @param target must be a valid node in the graph, ie. &ge; 0 and &lt; {@link
+   * @param level level of the graph
+   * @param target ordinal of a node in the graph, must be &ge; 0 and &lt; {@link
    *     VectorValues#size()}.
    */
-  public abstract void seek(int target) throws IOException;
+  public abstract void seek(int level, int target) throws IOException;
 
   /** Returns the number of nodes in the graph */
   public abstract int size();
 
   /**
    * Iterates over the neighbor list. It is illegal to call this method after it returns
-   * NO_MORE_DOCS without calling {@link #seek(int)}, which resets the iterator.
+   * NO_MORE_DOCS without calling {@link #seek(int, int)}, which resets the iterator.
    *
    * @return a node ordinal in the graph, or NO_MORE_DOCS if the iteration is complete.
    */
   public abstract int nextNeighbor() throws IOException;
 
+  /** Returns the number of levels of the graph */
+  public abstract int numLevels() throws IOException;
+
+  /** Returns graph's entry point on the top level * */
+  public abstract int entryNode() throws IOException;
+
+  /**
+   * Get all nodes on a given level as node 0th ordinals
+   *
+   * @param level level for which to get all nodes
+   * @return an iterator over nodes where {@code nextDoc} returns a next node ordinal
+   */
+  // TODO: return a more suitable iterator over nodes than DocIdSetIterator

Review comment:
       hmm, I guess we never addressed this TODO. I don't have a good alternative suggestion, maybe others do.

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
+    }
+
+    this.nodesByLevel = new ArrayList<>(numLevels);
+    nodesByLevel.add(null); // we don't need this for 0th level, as it contains all nodes
+    for (int l = 1; l < numLevels; l++) {
+      nodesByLevel.add(new int[] {0});
+    }
   }
 
   /**
-   * Searches for the nearest neighbors of a query vector.
+   * Searches HNSW graph for the nearest neighbors of a query vector.
    *
    * @param query search query vector
    * @param topK the number of nodes to be returned
-   * @param numSeed the size of the queue maintained while searching, and controls the number of
-   *     random entry points to sample
+   * @param numSeed the size of the queue maintained while searching

Review comment:
       Could we remove this parameter `numSeed`? It seems particularly out-of-place now that we don't sample random entry points.
   
   I think it'd be a nice simplification to only have one parameter `topK`. I can't think of a case where callers would set `numSeed` to be something different from `topK`.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsFormat.java
##########
@@ -69,10 +99,12 @@
 
   static final String META_CODEC_NAME = "Lucene90HnswVectorsFormatMeta";
   static final String VECTOR_DATA_CODEC_NAME = "Lucene90HnswVectorsFormatData";
-  static final String VECTOR_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatIndex";
+  static final String GRAPH_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatGraphIndex";

Review comment:
       Tiny comment, I always have to check to remind myself what these files contain. Maybe names like "GraphLayers" and "GraphNeighbors" would be more descriptive?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-957098407


   Got it, it sounds like you already adjusted my set-up to include warm-ups. Overall it looks like a positive performance improvement. I'm in favor of merging this even though the improvement is relatively small -- I think it's good to implement the actual algorithm that we claim to! I also think this sets us up well for future performance improvements, by closely comparing to other HNSW implementations.
   
   One last thing to check regarding performance: does it have an impact on indexing speed?
   
   Reviewing the code with fresh eyes, I found some more parts where I had questions. I know 9.0 feature freeze is coming up really soon, maybe we want to discuss the timing of this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-1014658302


   Thanks everyone for the review. I am closing this PR in favour of https://github.com/apache/lucene/pull/608 that copies these changes into a new Lucene91Codec and Lucene91HnswVectorsFormat. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

jpountz commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-1009970568


   Sorry @mayya-sharipova I had missed your ping. I realize we don't have good guidelines on writing file formats so I started something at https://github.com/apache/lucene/pull/598. I'll have a quick look at this PR now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r783193397



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsFormat.java
##########
@@ -35,14 +35,23 @@
  * <p>This file stores all the floating-point vector data ordered by field, document ordinal, and
  * vector dimension. The floats are stored in little-endian byte order.
  *
- * <h2>.vex (vector index) file</h2>
+ * <h2>.vex (vector index)</h2>
  *
- * <p>Stores graphs connecting the documents for each field. For each document having a vector for a
- * given field, this is stored as:
+ * <p>Stores graphs connecting the documents for each field organized as a list of nodes' neighbours
+ * as following:
  *
  * <ul>
- *   <li><b>[int32]</b> the number of neighbor nodes
- *   <li><b>array[vint]</b> the neighbor ordinals, delta-encoded (initially subtracting -1)
+ *   <li>For each level:
+ *       <ul>
+ *         <li>For each node:
+ *             <ul>
+ *               <li><b>[int32]</b> the number of neighbor nodes
+ *               <li><b>array[int32]</b> the neighbor ordinals
+ *               <li><b>array[int32]</b> padding from empty integers if the number of neigbors less

Review comment:
       Addressed in 5e42f220c9c35ed0ec567bd2b760c6cbe21671f4




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r783195632



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -464,30 +477,45 @@ private void readValue(int targetOrd) throws IOException {
   /** Read the nearest-neighbors graph from the index input */
   private static final class IndexedKnnGraphReader extends KnnGraphValues {
 
-    final FieldEntry entry;
     final IndexInput dataIn;
+    final int[][] nodesByLevel;
+    final long[] graphOffsetsByLevel;
+    final int numLevels;
+    final int entryNode;
+    final int size;
+    final long bytesForConns;
 
     int arcCount;
     int arcUpTo;
     int arc;
 
     IndexedKnnGraphReader(FieldEntry entry, IndexInput dataIn) {
-      this.entry = entry;
       this.dataIn = dataIn;
+      this.nodesByLevel = entry.nodesByLevel;
+      this.numLevels = entry.numLevels;
+      this.entryNode = numLevels > 1 ? nodesByLevel[numLevels - 1][0] : 0;
+      this.size = entry.size();
+      this.graphOffsetsByLevel = entry.graphOffsetsByLevel;
+      this.bytesForConns = ((long) entry.maxConn + 1) * 4;

Review comment:
       Corrected to use `Integer.BYTES` in 5e42f220c9c35ed0ec567bd2b760c6cbe21671f4




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r767385760



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =

Review comment:
       @jtibshirani  
   
   >  It feels a bit surprising that FieldEntry is constructed across two different files (both the metadata and graph index files)
   
   As an alternative, to make FieldEntry fields only from a single `meta` file, I've created a [PR](https://github.com/apache/lucene/pull/536) that keeps the current 3 files ( .vec, .vex, and .vem),  where .vem file stores all the level's nodes; but instead of storing neighbour's offsets we calculate them, which allows to keep .vem metadata file smaller. What do you think of this approach? 
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova edited a comment on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova edited a comment on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953142944


   Benchmarking based on @jtibshirani [setup](https://github.com/jtibshirani/lucene/pull/1)
   
   baseline: main branch
   candidate: this PR
   
   **glove-25-angular**
   |              | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ------------ | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10   |           0.626 |    10962.821 |            0.631 |      8869.807 |
   | n_cands=50   |           0.888 |     4409.952 |            0.889 |      4111.685 |
   | n_cands=100  |           0.946 |     2621.846 |            0.947 |      2734.787 |
   | n_cands=500  |           0.994 |      661.253 |            0.994 |       686.700 |
   | n_cands=800  |           0.997 |      430.172 |            0.997 |       459.356 |
   | n_cands=1000 |           0.998 |      342.915 |            0.998 |       355.238 |
   
   
   **sift-128-euclidean**
   
   |              | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ------------ | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10   |           0.601 |     6948.736 |            0.607 |      6677.931 |
   | n_cands=50   |           0.889 |     3003.781 |            0.892 |      3202.925 |
   | n_cands=100  |           0.952 |     1622.276 |            0.953 |      1996.992 |
   | n_cands=500  |           0.996 |      444.135 |            0.996 |       540.368 |
   | n_cands=800  |           0.998 |      296.835 |            0.998 |       367.316 |
   | n_cands=1000 |           0.999 |      245.498 |            0.999 |       311.339 |
   
   
   As can be seen from the comparison, there is very slight change that the hierarchy brings: a small increase in recall by at the expense of lower QPSs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova edited a comment on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova edited a comment on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953265461


   @jtibshirani  Thanks for the comment.
   
   >  I wonder if the first runs are affected by a lack of warm-up?
   
   I've added a warmup stage as well,  but starting with bogus query args in [ann benchmarking algorithm](https://github.com/jtibshirani/ann-benchmarks/blob/lucene-hnsw/algos.yaml#L70) . 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r740706179



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -99,32 +121,72 @@ public static NeighborQueue search(
       Bits acceptOrds,
       SplittableRandom random)

Review comment:
       I think this `SplittableRandom` is unused.

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -99,32 +121,72 @@ public static NeighborQueue search(
       Bits acceptOrds,
       SplittableRandom random)
       throws IOException {
+
     int size = graphValues.size();
+    int boundedNumSeed = Math.max(topK, Math.min(numSeed, 2 * size));
+    NeighborQueue results;
+
+    int[] eps = new int[] {graphValues.entryNode()};
+    for (int level = graphValues.numLevels() - 1; level >= 1; level--) {
+      results =
+          HnswGraph.searchLevel(

Review comment:
       Tiny comment, we can remove the `HnswGraph` qualifier.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =

Review comment:
       If I understand correctly, the graph index file is only used to populate these data structures on `FieldEntry` (the graph level information). It feels a bit surprising that `FieldEntry` is constructed across two different files (both the metadata and graph index files). It also means `FieldEntry` isn't immutable.
   
   hmm, I'm not sure what typically belongs in a metadata file. I assume it doesn't make sense to move the graph index information there, alongside the ord -> doc mapping. Maybe we could keep the file as-is but create a new class to hold the graph index, something like `GraphLevels` ?

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));

Review comment:
       I am a little confused why we are careful not to overallocate here, but in `addNode` we do `new NeighborArray(maxConn + 1)));`? I see you maintained the existing behavior, so not critical to address in this PR, I was just curious about it.

##########
File path: lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java
##########
@@ -32,25 +33,41 @@
   protected KnnGraphValues() {}
 
   /**
-   * Move the pointer to exactly {@code target}, the id of a node in the graph. After this method
+   * Move the pointer to exactly the given {@code level}'s {@code target}. After this method
    * returns, call {@link #nextNeighbor()} to return successive (ordered) connected node ordinals.
    *
-   * @param target must be a valid node in the graph, ie. &ge; 0 and &lt; {@link
+   * @param level level of the graph
+   * @param target ordinal of a node in the graph, must be &ge; 0 and &lt; {@link
    *     VectorValues#size()}.
    */
-  public abstract void seek(int target) throws IOException;
+  public abstract void seek(int level, int target) throws IOException;
 
   /** Returns the number of nodes in the graph */
   public abstract int size();
 
   /**
    * Iterates over the neighbor list. It is illegal to call this method after it returns
-   * NO_MORE_DOCS without calling {@link #seek(int)}, which resets the iterator.
+   * NO_MORE_DOCS without calling {@link #seek(int, int)}, which resets the iterator.
    *
    * @return a node ordinal in the graph, or NO_MORE_DOCS if the iteration is complete.
    */
   public abstract int nextNeighbor() throws IOException;
 
+  /** Returns the number of levels of the graph */
+  public abstract int numLevels() throws IOException;
+
+  /** Returns graph's entry point on the top level * */
+  public abstract int entryNode() throws IOException;
+
+  /**
+   * Get all nodes on a given level as node 0th ordinals
+   *
+   * @param level level for which to get all nodes
+   * @return an iterator over nodes where {@code nextDoc} returns a next node ordinal
+   */
+  // TODO: return a more suitable iterator over nodes than DocIdSetIterator

Review comment:
       hmm, I guess we never addressed this TODO. I don't have a good alternative suggestion, maybe others do.

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
+    }
+
+    this.nodesByLevel = new ArrayList<>(numLevels);
+    nodesByLevel.add(null); // we don't need this for 0th level, as it contains all nodes
+    for (int l = 1; l < numLevels; l++) {
+      nodesByLevel.add(new int[] {0});
+    }
   }
 
   /**
-   * Searches for the nearest neighbors of a query vector.
+   * Searches HNSW graph for the nearest neighbors of a query vector.
    *
    * @param query search query vector
    * @param topK the number of nodes to be returned
-   * @param numSeed the size of the queue maintained while searching, and controls the number of
-   *     random entry points to sample
+   * @param numSeed the size of the queue maintained while searching

Review comment:
       Could we remove this parameter `numSeed`? It seems particularly out-of-place now that we don't sample random entry points.
   
   I think it'd be a nice simplification to only have one parameter `topK`. I can't think of a case where callers would set `numSeed` to be something different from `topK`.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsFormat.java
##########
@@ -69,10 +99,12 @@
 
   static final String META_CODEC_NAME = "Lucene90HnswVectorsFormatMeta";
   static final String VECTOR_DATA_CODEC_NAME = "Lucene90HnswVectorsFormatData";
-  static final String VECTOR_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatIndex";
+  static final String GRAPH_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatGraphIndex";

Review comment:
       Tiny comment, I always have to check to remind myself what these files contain. Maybe names like "GraphLayers" and "GraphNeighbors" would be more descriptive?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r784284757



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -40,10 +41,10 @@
  * <h2>Hyperparameters</h2>
  *
  * <ul>
- *   <li><code>numSeed</code> is the equivalent of <code>m</code> in the 2012 paper; it controls the
+ *   <li><code>numSeed</code> is the equivalent of <code>m</code> in the 2014 paper; it controls the

Review comment:
       Good notice! Addressed in 7bce9f15daa225243c034c22ca9cb3d581e09fc5




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova edited a comment on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova edited a comment on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953142944


   Benchmarking based on @jtibshirani [setup](https://github.com/jtibshirani/lucene/pull/1)
   
   baseline: main branch
   candidate: this PR
   
   **glove-25-angular**
   |              | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ------------ | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10   |           0.626 |    10962.821 |            0.631 |      8869.807 |
   | n_cands=50   |           0.888 |     4409.952 |            0.889 |      4111.685 |
   | n_cands=100  |           0.946 |     2621.846 |            0.947 |      2734.787 |
   | n_cands=500  |           0.994 |      661.253 |            0.994 |       686.700 |
   | n_cands=800  |           0.997 |      430.172 |            0.997 |       459.356 |
   | n_cands=1000 |           0.998 |      342.915 |            0.998 |       355.238 |
   
   
   **glove-200-angular**
   |              | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ------------ | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10   |           0.285 |     4843.028 |            0.312 |      5208.453 |
   | n_cands=50   |           0.556 |     2119.933 |            0.558 |      2250.213 |
   | n_cands=100  |           0.655 |     1399.261 |            0.648 |      1454.996 |
   | n_cands=500  |           0.806 |      379.745 |            0.806 |       410.553 |
   | n_cands=800  |           0.836 |      252.796 |            0.836 |       276.456 |
   | n_cands=1000 |           0.849 |      201.012 |            0.849 |       220.739 |
   
   
   
   
   **sift-128-euclidean**
   
   |              | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ------------ | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10   |           0.601 |     6948.736 |            0.607 |      6677.931 |
   | n_cands=50   |           0.889 |     3003.781 |            0.892 |      3202.925 |
   | n_cands=100  |           0.952 |     1622.276 |            0.953 |      1996.992 |
   | n_cands=500  |           0.996 |      444.135 |            0.996 |       540.368 |
   | n_cands=800  |           0.998 |      296.835 |            0.998 |       367.316 |
   | n_cands=1000 |           0.999 |      245.498 |            0.999 |       311.339 |
   
   
   As can be seen from the comparison, there is very slight change that the hierarchy brings: a small increase in recall by at the expense of lower QPSs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r740706179



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -99,32 +121,72 @@ public static NeighborQueue search(
       Bits acceptOrds,
       SplittableRandom random)

Review comment:
       I think this `SplittableRandom` is unused.

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -99,32 +121,72 @@ public static NeighborQueue search(
       Bits acceptOrds,
       SplittableRandom random)
       throws IOException {
+
     int size = graphValues.size();
+    int boundedNumSeed = Math.max(topK, Math.min(numSeed, 2 * size));
+    NeighborQueue results;
+
+    int[] eps = new int[] {graphValues.entryNode()};
+    for (int level = graphValues.numLevels() - 1; level >= 1; level--) {
+      results =
+          HnswGraph.searchLevel(

Review comment:
       Tiny comment, we can remove the `HnswGraph` qualifier.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =

Review comment:
       If I understand correctly, the graph index file is only used to populate these data structures on `FieldEntry` (the graph level information). It feels a bit surprising that `FieldEntry` is constructed across two different files (both the metadata and graph index files). It also means `FieldEntry` isn't immutable.
   
   hmm, I'm not sure what typically belongs in a metadata file. I assume it doesn't make sense to move the graph index information there, alongside the ord -> doc mapping. Maybe we could keep the file as-is but create a new class to hold the graph index, something like `GraphLevels` ?

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));

Review comment:
       I am a little confused why we are careful not to overallocate here, but in `addNode` we do `new NeighborArray(maxConn + 1)));`? I see you maintained the existing behavior, so not critical to address in this PR, I was just curious about it.

##########
File path: lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java
##########
@@ -32,25 +33,41 @@
   protected KnnGraphValues() {}
 
   /**
-   * Move the pointer to exactly {@code target}, the id of a node in the graph. After this method
+   * Move the pointer to exactly the given {@code level}'s {@code target}. After this method
    * returns, call {@link #nextNeighbor()} to return successive (ordered) connected node ordinals.
    *
-   * @param target must be a valid node in the graph, ie. &ge; 0 and &lt; {@link
+   * @param level level of the graph
+   * @param target ordinal of a node in the graph, must be &ge; 0 and &lt; {@link
    *     VectorValues#size()}.
    */
-  public abstract void seek(int target) throws IOException;
+  public abstract void seek(int level, int target) throws IOException;
 
   /** Returns the number of nodes in the graph */
   public abstract int size();
 
   /**
    * Iterates over the neighbor list. It is illegal to call this method after it returns
-   * NO_MORE_DOCS without calling {@link #seek(int)}, which resets the iterator.
+   * NO_MORE_DOCS without calling {@link #seek(int, int)}, which resets the iterator.
    *
    * @return a node ordinal in the graph, or NO_MORE_DOCS if the iteration is complete.
    */
   public abstract int nextNeighbor() throws IOException;
 
+  /** Returns the number of levels of the graph */
+  public abstract int numLevels() throws IOException;
+
+  /** Returns graph's entry point on the top level * */
+  public abstract int entryNode() throws IOException;
+
+  /**
+   * Get all nodes on a given level as node 0th ordinals
+   *
+   * @param level level for which to get all nodes
+   * @return an iterator over nodes where {@code nextDoc} returns a next node ordinal
+   */
+  // TODO: return a more suitable iterator over nodes than DocIdSetIterator

Review comment:
       hmm, I guess we never addressed this TODO. I don't have a good alternative suggestion, maybe others do.

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
+    }
+
+    this.nodesByLevel = new ArrayList<>(numLevels);
+    nodesByLevel.add(null); // we don't need this for 0th level, as it contains all nodes
+    for (int l = 1; l < numLevels; l++) {
+      nodesByLevel.add(new int[] {0});
+    }
   }
 
   /**
-   * Searches for the nearest neighbors of a query vector.
+   * Searches HNSW graph for the nearest neighbors of a query vector.
    *
    * @param query search query vector
    * @param topK the number of nodes to be returned
-   * @param numSeed the size of the queue maintained while searching, and controls the number of
-   *     random entry points to sample
+   * @param numSeed the size of the queue maintained while searching

Review comment:
       Could we remove this parameter `numSeed`? It seems particularly out-of-place now that we don't sample random entry points.
   
   I think it'd be a nice simplification to only have one parameter `topK`. I can't think of a case where callers would set `numSeed` to be something different from `topK`.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsFormat.java
##########
@@ -69,10 +99,12 @@
 
   static final String META_CODEC_NAME = "Lucene90HnswVectorsFormatMeta";
   static final String VECTOR_DATA_CODEC_NAME = "Lucene90HnswVectorsFormatData";
-  static final String VECTOR_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatIndex";
+  static final String GRAPH_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatGraphIndex";

Review comment:
       Tiny comment, I always have to check to remind myself what these files contain. Maybe names like "GraphLayers" and "GraphNeighbors" would be more descriptive?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova edited a comment on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova edited a comment on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953265461


   @jtibshirani  Thanks for the comment.
   
   >  I wonder if the first runs are affected by a lack of warm-up?
   
   In my benchmarks I had a warmup stage as well starting with bogus query args in [ann benchmarking algorithm](https://github.com/jtibshirani/ann-benchmarks/blob/lucene-hnsw/algos.yaml#L70) . 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r762700036



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
+    }
+
+    this.nodesByLevel = new ArrayList<>(numLevels);
+    nodesByLevel.add(null); // we don't need this for 0th level, as it contains all nodes
+    for (int l = 1; l < numLevels; l++) {
+      nodesByLevel.add(new int[] {0});
+    }
   }
 
   /**
-   * Searches for the nearest neighbors of a query vector.
+   * Searches HNSW graph for the nearest neighbors of a query vector.
    *
    * @param query search query vector
    * @param topK the number of nodes to be returned
-   * @param numSeed the size of the queue maintained while searching, and controls the number of
-   *     random entry points to sample
+   * @param numSeed the size of the queue maintained while searching

Review comment:
       @jtibshirani Thanks, for now removing `numSeed` with a possible re-discussion makes sense to me. Addressed in b421dcebc8dc2f4d8700ab8241fd0af013ab338e




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova edited a comment on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova edited a comment on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-972170715


   @jtibshirani Thanks for your extensive review. I've tried to address some comments, and some comments I think we still need to discuss
   
   > does it have an impact on indexing speed?
   
   I've run another benchmarking experiment with `KnnGraphTester` , and as can be seen below there seems to be a small impact on indexing speed:  6% increase in indexing time;.
   
   #### Comparison of glove-100
   
   baseline: main branch
   candidate: this PR
   
   | Baseline                                                     | candidate                                                    |
   | ------------------------------------------------------------ | ------------------------------------------------------------ |
   | built 1180000 in 23061/2280885 ms                            | built 1180000 in 23113/2295566 ms                            |
   | flushed: segment=_1 ramUsed=474.061 MB newFlushedSize=513.284 MB docs/MB=2,305.768 | flushed: segment=_0 ramUsed=474.061 MB newFlushedSize=517.153 MB docs/MB=2,288.517 |
   
   
   
   |             | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ----------- | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10  |           0.482 |     4199.190 |            0.496 |      3982.133 |
   | n_cands=20  |           0.553 |     3563.974 |            0.561 |      3412.251 |
   | n_cands=40  |           0.631 |     2713.881 |            0.635 |      2455.465 |
   | n_cands=80  |           0.707 |     1902.787 |            0.709 |      1863.012 |
   | n_cands=120 |           0.748 |     1497.979 |            0.747 |      1400.822 |
   | n_cands=200 |           0.790 |     1031.353 |            0.791 |      1018.244 |
   | n_cands=400 |           0.840 |      577.106 |            0.841 |       600.762 |
   | n_cands=600 |           0.865 |      417.961 |            0.865 |       436.488 |
   | n_cands=800 |           0.881 |      319.900 |            0.881 |       334.503 |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r754297662



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =

Review comment:
       @jpountz I wonder if Adrien can suggest us better how to organize vector files?  
   
   For the context: currently, `.vem (vector metadata)` can be quite big as it contains graph offsets (for 1M docs: 1M * 8 bytes = 8Mb; for 10M docs: 80 Mb). This PR tries to store graph offsets into a separate file, but still load them during initialization of `FieldEntry`. The problem then as @jtibshirani noticed `FieldEntry` data is constructed from two files: the metadata file and  the file that stores graph offsets.  May be it is worth to keep the original format and store all this data in the metadata file?

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =

Review comment:
       @jpountz I wonder if you can suggest us better how to organize vector files?  
   
   For the context: currently, `.vem (vector metadata)` can be quite big as it contains graph offsets (for 1M docs: 1M * 8 bytes = 8Mb; for 10M docs: 80 Mb). This PR tries to store graph offsets into a separate file, but still load them during initialization of `FieldEntry`. The problem then as @jtibshirani noticed `FieldEntry` data is constructed from two files: the metadata file and  the file that stores graph offsets.  May be it is worth to keep the original format and store all this data in the metadata file?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953265461


   @jtibshirani  Thanks for the comment.
   
   >  I wonder if the first runs are affected by a lack of warm-up?
   
   I've added a warmup stage as well,  but starting with bogus query args in ann benchmarking algorithm. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-953142944


   Benchmarking based on @jtibshirani [setup](based on https://github.com/jtibshirani/lucene/pull/1)
   
   baseline: main branch
   candidate: this PR
   
   **glove-25-angular**
   |              | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ------------ | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10   |           0.626 |    10962.821 |            0.631 |      8869.807 |
   | n_cands=50   |           0.888 |     4409.952 |            0.889 |      4111.685 |
   | n_cands=100  |           0.946 |     2621.846 |            0.947 |      2734.787 |
   | n_cands=500  |           0.994 |      661.253 |            0.994 |       686.700 |
   | n_cands=800  |           0.997 |      430.172 |            0.997 |       459.356 |
   | n_cands=1000 |           0.998 |      342.915 |            0.998 |       355.238 |
   
   
   **sift-128-euclidean**
   
   |              | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ------------ | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10   |           0.601 |     6948.736 |            0.607 |      6677.931 |
   | n_cands=50   |           0.889 |     3003.781 |            0.892 |      3202.925 |
   | n_cands=100  |           0.952 |     1622.276 |            0.953 |      1996.992 |
   | n_cands=500  |           0.996 |      444.135 |            0.996 |       540.368 |
   | n_cands=800  |           0.998 |      296.835 |            0.998 |       367.316 |
   | n_cands=1000 |           0.999 |      245.498 |            0.999 |       311.339 |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r783194796



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -147,7 +137,6 @@ private static IndexInput openDataInput(
               + versionVectorData,
           in);
     }
-    checksumRef[0] = CodecUtil.retrieveChecksum(in);

Review comment:
       Thanks for the detailed explanation. It completely makes sense. I've reinstated the retrieving of a checksum in 5e42f220c9c35ed0ec567bd2b760c6cbe21671f4




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r784284891



##########
File path: lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java
##########
@@ -116,33 +116,59 @@ public KnnVectorsFormat getKnnVectorsFormatForField(String field) {
                               ((CodecReader) ctx.reader()).getVectorReader())
                           .getFieldReader("field"))
                   .getGraphValues("field");
-          assertGraphEqual(hnsw, graphValues, nVec);
+          assertGraphEqual(hnsw, graphValues);
         }
       }
     }
   }
 
+  private void assertGraphEqual(KnnGraphValues g, KnnGraphValues h) throws IOException {
+    assertEquals("the number of levels in the graphs are different!", g.numLevels(), h.numLevels());
+    assertEquals("the number of nodes in the graphs are different!", g.size(), h.size());
+
+    // assert equal nodes on each level
+    for (int level = 0; level < g.numLevels(); level++) {
+      NodesIterator nodesOnLevel = g.getNodesOnLevel(level);
+      NodesIterator nodesOnLevel2 = h.getNodesOnLevel(level);
+      while (nodesOnLevel.hasNext() && nodesOnLevel2.hasNext()) {
+        int node = nodesOnLevel.nextInt();
+        int node2 = nodesOnLevel2.nextInt();
+        assertEquals("nodes in the graphs are different", node, node2);
+      }
+    }
+
+    // assert equal nodes' neighbours on each level
+    for (int level = 0; level < g.numLevels(); level++) {
+      NodesIterator nodesOnLevel = g.getNodesOnLevel(level);
+      while (nodesOnLevel.hasNext()) {
+        int node = nodesOnLevel.nextInt();
+        g.seek(level, node);
+        h.seek(level, node);
+        assertEquals("arcs differ for node " + node, getNeighborNodes(g), getNeighborNodes(h));
+      }
+    }
+  }
+
   // Make sure we actually approximately find the closest k elements. Mostly this is about
   // ensuring that we have all the distance functions, comparators, priority queues and so on
   // oriented in the right directions
   public void testAknnDiverse() throws IOException {
+    int maxConn = 10;
     int nDoc = 100;
-    CircularVectorValues vectors = new CircularVectorValues(nDoc);
+    TestHnswGraph.CircularVectorValues vectors = new TestHnswGraph.CircularVectorValues(nDoc);

Review comment:
       Addressed in 7bce9f15daa225243c034c22ca9cb3d581e09fc5




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r769177664



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =

Review comment:
       I like this idea. How would you like to proceed -- work on that PR first (since it seems useful on its own), or move forward with this one and follow-up with a fix soon after?
   
   To clarify, I was not thinking that `GraphLevels` would replace `FieldEntry`. It would be a second data structure.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

msokolov commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-997209709


   @mayya-sharipova I don't have strong feelings about the file structure; either way seems OK to me. I guess if pressed I would probably choose to keep a smaller number of files since it will reduce the amount of managing/opening/closing files, and the directory will be less cluttered (most of the formats have one or two files?). I don't see any particular downside to having different kinds of information in a file, although then I guess the file parsing code can be a little more complicated, so it's a tradeoff without a clear deciding factor for me?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r751672547



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));

Review comment:
       I am curios as well; this was indeed carried over from the existing code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-957098407


   Got it, it sounds like you already adjusted my set-up to include warm-ups. Overall it looks like a positive performance improvement. I'm in favor of merging this even though the improvement is relatively small -- I think it's good to implement the actual algorithm that we claim to! I also think this sets us up well for future performance improvements, by closely comparing to other HNSW implementations.
   
   One last thing to check regarding performance: does it have an impact on indexing speed?
   
   Reviewing the code with fresh eyes, I found some more parts where I had questions. I know 9.0 feature freeze is coming up really soon, maybe we want to discuss the timing of this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

msokolov commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r752477271



##########
File path: lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java
##########
@@ -32,25 +33,41 @@
   protected KnnGraphValues() {}
 
   /**
-   * Move the pointer to exactly {@code target}, the id of a node in the graph. After this method
+   * Move the pointer to exactly the given {@code level}'s {@code target}. After this method
    * returns, call {@link #nextNeighbor()} to return successive (ordered) connected node ordinals.
    *
-   * @param target must be a valid node in the graph, ie. &ge; 0 and &lt; {@link
+   * @param level level of the graph
+   * @param target ordinal of a node in the graph, must be &ge; 0 and &lt; {@link
    *     VectorValues#size()}.
    */
-  public abstract void seek(int target) throws IOException;
+  public abstract void seek(int level, int target) throws IOException;
 
   /** Returns the number of nodes in the graph */
   public abstract int size();
 
   /**
    * Iterates over the neighbor list. It is illegal to call this method after it returns
-   * NO_MORE_DOCS without calling {@link #seek(int)}, which resets the iterator.
+   * NO_MORE_DOCS without calling {@link #seek(int, int)}, which resets the iterator.
    *
    * @return a node ordinal in the graph, or NO_MORE_DOCS if the iteration is complete.
    */
   public abstract int nextNeighbor() throws IOException;
 
+  /** Returns the number of levels of the graph */
+  public abstract int numLevels() throws IOException;
+
+  /** Returns graph's entry point on the top level * */
+  public abstract int entryNode() throws IOException;
+
+  /**
+   * Get all nodes on a given level as node 0th ordinals
+   *
+   * @param level level for which to get all nodes
+   * @return an iterator over nodes where {@code nextDoc} returns a next node ordinal
+   */
+  // TODO: return a more suitable iterator over nodes than DocIdSetIterator

Review comment:
       I guess we could make one up




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r784284582



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,75 +57,124 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
+    }
+
+    this.nodesByLevel = new ArrayList<>(numLevels);
+    nodesByLevel.add(null); // we don't need this for 0th level, as it contains all nodes
+    for (int l = 1; l < numLevels; l++) {
+      nodesByLevel.add(new int[] {0});
+    }
   }
 
   /**
-   * Searches for the nearest neighbors of a query vector.
+   * Searches HNSW graph for the nearest neighbors of a query vector.
    *
    * @param query search query vector
    * @param topK the number of nodes to be returned
-   * @param numSeed the size of the queue maintained while searching, and controls the number of
-   *     random entry points to sample
    * @param vectors vector values
    * @param graphValues the graph values. May represent the entire graph, or a level in a
    *     hierarchical graph.
    * @param acceptOrds {@link Bits} that represents the allowed document ordinals to match, or
    *     {@code null} if they are all allowed to match.
-   * @param random a source of randomness, used for generating entry points to the graph
    * @return a priority queue holding the closest neighbors found
    */
   public static NeighborQueue search(
       float[] query,
       int topK,
-      int numSeed,
       RandomAccessVectorValues vectors,
       VectorSimilarityFunction similarityFunction,
       KnnGraphValues graphValues,
-      Bits acceptOrds,
-      SplittableRandom random)
+      Bits acceptOrds)
       throws IOException {
+
     int size = graphValues.size();
+    int queueSize = Math.min(topK, 2 * size);

Review comment:
       Great comment, indeed it is enough just to use `topK`. Addressed in 7bce9f15daa225243c034c22ca9cb3d581e09fc5




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-995224656


   @mayya-sharipova thanks for all your work on this PR. For me, there aren't blockers to merging. I'll do a final review tomorrow (with my morning coffee :)) to make sure I didn't miss anything. What are the next steps -- are you also hoping for a review from Adrien or Mike, who are experienced at designing file formats? Any open questions for them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

msokolov commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r752476787



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));

Review comment:
       heh, this makes no sense at all! Git-spelunking, it looks like I added this fancy dancing when adding the diversity criteria, but overlooked `addNode` - which does the majority of the allocation! It could be worth taking a lok to see if we can cheaply save some RAM




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r751671791



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsFormat.java
##########
@@ -69,10 +99,12 @@
 
   static final String META_CODEC_NAME = "Lucene90HnswVectorsFormatMeta";
   static final String VECTOR_DATA_CODEC_NAME = "Lucene90HnswVectorsFormatData";
-  static final String VECTOR_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatIndex";
+  static final String GRAPH_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatGraphIndex";

Review comment:
       Great suggestion, addressed in 2961b1c




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r769229827



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =

Review comment:
       > How would you like to proceed -- work on that PR first (since it seems useful on its own), or move forward with this one and follow-up with a fix soon after?
   
   I was under the impression that we are not happy with the current state of this PR and would not want to merge it without some changes. No?
   
   > To clarify, I was not thinking that GraphLevels would replace FieldEntry. It would be a second data structure.
   
   Can you please elaborate more how to do you see it is organized? How `GraphLevels` connected to a `FieldEntry`?  Do you suggest put GraphLevels into a separate file and load them on a first use? 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-957098407


   Got it, it sounds like you already adjusted my set-up to include warm-ups. Overall it looks like a positive performance improvement. I'm in favor of merging this even though the improvement is relatively small -- I think it's good to implement the actual algorithm that we claim to! I also think this sets us up well for future performance improvements, by closely comparing to other HNSW implementations.
   
   One last thing to check regarding performance: does it have an impact on indexing speed?
   
   Reviewing the code with fresh eyes, I found some more parts where I had questions. I know 9.0 feature freeze is coming up really soon, maybe we want to discuss the timing of this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r783195370



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -464,30 +477,45 @@ private void readValue(int targetOrd) throws IOException {
   /** Read the nearest-neighbors graph from the index input */
   private static final class IndexedKnnGraphReader extends KnnGraphValues {
 
-    final FieldEntry entry;
     final IndexInput dataIn;
+    final int[][] nodesByLevel;
+    final long[] graphOffsetsByLevel;
+    final int numLevels;
+    final int entryNode;
+    final int size;
+    final long bytesForConns;
 
     int arcCount;
     int arcUpTo;
     int arc;
 
     IndexedKnnGraphReader(FieldEntry entry, IndexInput dataIn) {
-      this.entry = entry;
       this.dataIn = dataIn;
+      this.nodesByLevel = entry.nodesByLevel;
+      this.numLevels = entry.numLevels;
+      this.entryNode = numLevels > 1 ? nodesByLevel[numLevels - 1][0] : 0;
+      this.size = entry.size();
+      this.graphOffsetsByLevel = entry.graphOffsetsByLevel;
+      this.bytesForConns = ((long) entry.maxConn + 1) * 4;
     }
 
     @Override
-    public void seek(int targetOrd) throws IOException {
+    public void seek(int level, int targetOrd) throws IOException {
+      int targetIndex =
+          level == 0
+              ? targetOrd
+              : Arrays.binarySearch(nodesByLevel[level], 0, nodesByLevel[level].length, targetOrd);
+      long graphDataOffset = graphOffsetsByLevel[level] + targetIndex * bytesForConns;

Review comment:
       I've added the assertion for this check in 5e42f220c9c35ed0ec567bd2b760c6cbe21671f4




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r784284382



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,75 +57,124 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
+    }
+
+    this.nodesByLevel = new ArrayList<>(numLevels);
+    nodesByLevel.add(null); // we don't need this for 0th level, as it contains all nodes
+    for (int l = 1; l < numLevels; l++) {
+      nodesByLevel.add(new int[] {0});
+    }
   }
 
   /**
-   * Searches for the nearest neighbors of a query vector.
+   * Searches HNSW graph for the nearest neighbors of a query vector.
    *
    * @param query search query vector
    * @param topK the number of nodes to be returned
-   * @param numSeed the size of the queue maintained while searching, and controls the number of
-   *     random entry points to sample
    * @param vectors vector values
    * @param graphValues the graph values. May represent the entire graph, or a level in a
    *     hierarchical graph.
    * @param acceptOrds {@link Bits} that represents the allowed document ordinals to match, or
    *     {@code null} if they are all allowed to match.
-   * @param random a source of randomness, used for generating entry points to the graph
    * @return a priority queue holding the closest neighbors found
    */
   public static NeighborQueue search(
       float[] query,
       int topK,
-      int numSeed,
       RandomAccessVectorValues vectors,
       VectorSimilarityFunction similarityFunction,
       KnnGraphValues graphValues,
-      Bits acceptOrds,
-      SplittableRandom random)
+      Bits acceptOrds)
       throws IOException {
+
     int size = graphValues.size();
+    int queueSize = Math.min(topK, 2 * size);
+    NeighborQueue results;
+
+    int[] eps = new int[] {graphValues.entryNode()};
+    for (int level = graphValues.numLevels() - 1; level >= 1; level--) {
+      results = searchLevel(query, 1, level, eps, vectors, similarityFunction, graphValues, null);
+      eps[0] = results.pop();
+    }
+    results =
+        searchLevel(query, queueSize, 0, eps, vectors, similarityFunction, graphValues, acceptOrds);
+    return results;
+  }
+
+  /**
+   * Searches for the nearest neighbors of a query vector in a given level
+   *
+   * @param query search query vector
+   * @param topK the number of nearest to query results to return
+   * @param level level to search
+   * @param eps the entry points for search at this level expressed as level 0th ordinals
+   * @param vectors vector values
+   * @param similarityFunction similarity function
+   * @param graphValues the graph values
+   * @param acceptOrds {@link Bits} that represents the allowed document ordinals to match, or
+   *     {@code null} if they are all allowed to match.
+   * @return a priority queue holding the closest neighbors found
+   */
+  static NeighborQueue searchLevel(
+      float[] query,
+      int topK,
+      int level,
+      final int[] eps,
+      RandomAccessVectorValues vectors,
+      VectorSimilarityFunction similarityFunction,
+      KnnGraphValues graphValues,
+      Bits acceptOrds)
+      throws IOException {
 
+    int size = graphValues.size();
+    int queueSize = Math.max(eps.length, topK);

Review comment:
       Great comment, indeed we can use use `topK`. Addressed in 7bce9f15daa225243c034c22ca9cb3d581e09fc5




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r784285235



##########
File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java
##########
@@ -153,21 +161,56 @@ public void testMergeProducesSameGraph() throws Exception {
     int dimension = atLeast(10);
     float[][] values = randomVectors(numDoc, dimension);
     int mergePoint = random().nextInt(numDoc);
-    int[][] mergedGraph = getIndexedGraph(values, mergePoint, seed);
-    int[][] singleSegmentGraph = getIndexedGraph(values, -1, seed);
+    int[][][] mergedGraph = getIndexedGraph(values, mergePoint, seed);
+    int[][][] singleSegmentGraph = getIndexedGraph(values, -1, seed);
     assertGraphEquals(singleSegmentGraph, mergedGraph);
   }
 
-  private void assertGraphEquals(int[][] expected, int[][] actual) {
+  /** Test writing and reading of multiple vector fields * */
+  public void testMultipleVectorFields() throws Exception {
+    int numVectorFields = randomIntBetween(2, 5);
+    int numDoc = atLeast(100);
+    int[] dims = new int[numVectorFields];
+    float[][][] values = new float[numVectorFields][][];
+    for (int field = 0; field < numVectorFields; field++) {
+      dims[field] = atLeast(3);
+      values[field] = randomVectors(numDoc, dims[field]);
+    }
+
+    try (Directory dir = newDirectory();
+        IndexWriter iw = new IndexWriter(dir, newIndexWriterConfig(null).setCodec(codec))) {
+      for (int docID = 0; docID < numDoc; docID++) {
+        Document doc = new Document();
+        for (int field = 0; field < numVectorFields; field++) {
+          float[] vector = values[field][docID];
+          if (vector != null) {
+            FieldType fieldType = KnnVectorField.createFieldType(vector.length, similarityFunction);
+            doc.add(new KnnVectorField(KNN_GRAPH_FIELD + field, vector, fieldType));
+          }
+        }
+        String idString = Integer.toString(docID);
+        doc.add(new StringField("id", idString, Field.Store.YES));
+        iw.addDocument(doc);
+      }
+      for (int field = 0; field < numVectorFields; field++) {
+        assertConsistentGraph(iw, values[field], KNN_GRAPH_FIELD + field);
+      }
+    }
+  }
+
+  private void assertGraphEquals(int[][][] expected, int[][][] actual) {
     assertEquals("graph sizes differ", expected.length, actual.length);
-    for (int i = 0; i < expected.length; i++) {
-      assertArrayEquals("difference at ord=" + i, expected[i], actual[i]);
+    for (int level = 0; level < expected.length; level++) {
+      for (int node = 0; node < expected[level].length; node++) {
+        assertArrayEquals("difference at ord=" + node, expected[level][node], actual[level][node]);
+      }
     }
   }
 
-  private int[][] getIndexedGraph(float[][] values, int mergePoint, long seed) throws IOException {
+  private int[][][] getIndexedGraph(float[][] values, int mergePoint, long seed)

Review comment:
       Javadoc added in 7bce9f15daa225243c034c22ca9cb3d581e09fc5




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova closed pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova closed pull request #416:
URL: https://github.com/apache/lucene/pull/416


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r783592663



##########
File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java
##########
@@ -279,6 +303,77 @@ public void testSearch() throws Exception {
     }
   }
 
+  private void indexData(IndexWriter iw) throws IOException {
+    // Add a document for every cartesian point in an NxN square so we can
+    // easily know which are the nearest neighbors to every point. Insert by iterating
+    // using a prime number that is not a divisor of N*N so that we will hit each point once,
+    // and chosen so that points will be inserted in a deterministic
+    // but somewhat distributed pattern
+    int n = 5, stepSize = 17;
+    float[][] values = new float[n * n][];
+    int index = 0;
+    for (int i = 0; i < values.length; i++) {
+      // System.out.printf("%d: (%d, %d)\n", i, index % n, index / n);
+      int x = index % n, y = index / n;
+      values[i] = new float[] {x, y};
+      index = (index + stepSize) % (n * n);
+      add(iw, i, values[i]);
+      if (i == 13) {
+        // create 2 segments
+        iw.commit();
+      }
+    }
+    boolean forceMerge = random().nextBoolean();
+    if (forceMerge) {
+      iw.forceMerge(1);
+    }
+    assertConsistentGraph(iw, values);
+  }
+
+  public void testMultiThreadedSearch() throws Exception {
+    similarityFunction = VectorSimilarityFunction.EUCLIDEAN;
+    IndexWriterConfig config = newIndexWriterConfig();
+    config.setCodec(codec);
+    Directory dir = newDirectory();
+    IndexWriter iw = new IndexWriter(dir, config);
+    indexData(iw);
+
+    final SearcherManager manager = new SearcherManager(iw, new SearcherFactory());
+    Thread[] threads = new Thread[randomIntBetween(2, 5)];
+    final CountDownLatch latch = new CountDownLatch(1);
+    for (int i = 0; i < threads.length; i++) {
+      threads[i] =
+          new Thread(
+              () -> {
+                try {
+                  latch.await();
+                  IndexSearcher searcher = manager.acquire();
+                  try {
+                    KnnVectorQuery query = new KnnVectorQuery("vector", new float[] {0f, 0.1f}, 5);

Review comment:
       `assertGraphSearch` expects an `IndexReader` and user reader fo search, while in this test we have `SearcherManager` that uses `IndexSearcher` for search; but the assertion of results is copied from `assertGraphSearch` .
   
   The test tests that there is no problem with accessing a graph from multiple threads at the same time.  We can move this test to `TestKnnVectorQuery` if you think that it will fit there better.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org