You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/09/28 20:29:12 UTC

[GitHub] [lucene] jtibshirani commented on a change in pull request #315: Disk write and read of hnsw graph

jtibshirani commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r717773659



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsFormat.java
##########
@@ -69,10 +99,12 @@
 
   static final String META_CODEC_NAME = "Lucene90HnswVectorsFormatMeta";
   static final String VECTOR_DATA_CODEC_NAME = "Lucene90HnswVectorsFormatData";
-  static final String VECTOR_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatIndex";
+  static final String GRAPH_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatGraphIndex";
+  static final String GRAPH_DATA_CODEC_NAME = "Lucene90HnswVectorsFormatGraphData";
   static final String META_EXTENSION = "vem";
   static final String VECTOR_DATA_EXTENSION = "vec";
-  static final String VECTOR_INDEX_EXTENSION = "vex";
+  static final String GRAPH_INDEX_EXTENSION = "vex";
+  static final String GRAPH_DATA_EXTENSION = "vgr";

Review comment:
       Not a big deal but looking at other formats, it's very common for extensions to start with the same characters. So maybe this could be `veg` instead of `vgr`.

##########
File path: lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java
##########
@@ -52,12 +53,21 @@ protected KnnGraphValues() {}
    */
   public abstract int nextNeighbor() throws IOException;
 
-  /** Returns top level of the graph * */
-  public abstract int maxLevel() throws IOException;
+  /** Returns the number of levels of the graph */
+  public abstract int numOfLevels() throws IOException;

Review comment:
       Super small comment, saying `numLevels` seems more typical?

##########
File path: lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java
##########
@@ -268,103 +281,47 @@ public void testDiversity() throws IOException {
     // First add nodes until everybody gets a full neighbor list
     HnswGraphBuilder builder =
         new HnswGraphBuilder(
-            vectors, VectorSimilarityFunction.DOT_PRODUCT, 2, 10, 0, random().nextInt());
+            vectors, VectorSimilarityFunction.DOT_PRODUCT, 2, 10, random().nextInt());
     // node 0 is added by the builder constructor
     // builder.addGraphNode(vectors.vectorValue(0));
     builder.addGraphNode(1, vectors.vectorValue(1));
     builder.addGraphNode(2, vectors.vectorValue(2));
     // now every node has tried to attach every other node as a neighbor, but
     // some were excluded based on diversity check.
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    assertNeighbors(builder.hnsw, 1, 0);
-    assertNeighbors(builder.hnsw, 2, 0);
+    assertLevel0Neighbors(builder.hnsw, 0, 1, 2);
+    assertLevel0Neighbors(builder.hnsw, 1, 0);
+    assertLevel0Neighbors(builder.hnsw, 2, 0);
 
     builder.addGraphNode(3, vectors.vectorValue(3));
-    assertNeighbors(builder.hnsw, 0, 1, 2);
+    assertLevel0Neighbors(builder.hnsw, 0, 1, 2);
     // we added 3 here
-    assertNeighbors(builder.hnsw, 1, 0, 3);
-    assertNeighbors(builder.hnsw, 2, 0);
-    assertNeighbors(builder.hnsw, 3, 1);
+    assertLevel0Neighbors(builder.hnsw, 1, 0, 3);
+    assertLevel0Neighbors(builder.hnsw, 2, 0);
+    assertLevel0Neighbors(builder.hnsw, 3, 1);
 
     // supplant an existing neighbor
     builder.addGraphNode(4, vectors.vectorValue(4));
     // 4 is the same distance from 0 that 2 is; we leave the existing node in place
-    assertNeighbors(builder.hnsw, 0, 1, 2);
+    assertLevel0Neighbors(builder.hnsw, 0, 1, 2);
     // 4 is closer to 1 than either existing neighbor (0, 3). 3 fails diversity check with 4, so
     // replace it
-    assertNeighbors(builder.hnsw, 1, 0, 4);
-    assertNeighbors(builder.hnsw, 2, 0);
+    assertLevel0Neighbors(builder.hnsw, 1, 0, 4);
+    assertLevel0Neighbors(builder.hnsw, 2, 0);
     // 1 survives the diversity check
-    assertNeighbors(builder.hnsw, 3, 1, 4);
-    assertNeighbors(builder.hnsw, 4, 1, 3);
+    assertLevel0Neighbors(builder.hnsw, 3, 1, 4);
+    assertLevel0Neighbors(builder.hnsw, 4, 1, 3);
 
     builder.addGraphNode(5, vectors.vectorValue(5));
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    assertNeighbors(builder.hnsw, 1, 0, 5);
-    assertNeighbors(builder.hnsw, 2, 0);
-    // even though 5 is closer, 3 is not a neighbor of 5, so no update to *its* neighbors occurs
-    assertNeighbors(builder.hnsw, 3, 1, 4);
-    assertNeighbors(builder.hnsw, 4, 3, 5);
-    assertNeighbors(builder.hnsw, 5, 1, 4);
-  }
-
-  public void testDiversityHNSW() throws IOException {
-    // Some carefully checked test cases with simple 2d vectors on the unit circle:
-    MockVectorValues vectors =
-        new MockVectorValues(
-            new float[][] {
-              unitVector2d(0.5),
-              unitVector2d(0.75),
-              unitVector2d(0.2),
-              unitVector2d(0.9),
-              unitVector2d(0.8),
-              unitVector2d(0.77),
-            });
-    // First add nodes until everybody gets a full neighbor list
-    int maxConn = 2;
-    double ml = 1 / Math.log(1.0 * maxConn);
-    HnswGraphBuilder builder =
-        new HnswGraphBuilder(
-            vectors, VectorSimilarityFunction.DOT_PRODUCT, maxConn, 10, ml, random().nextInt());
-    // node 0 is added by the builder constructor
-    builder.addGraphNodeHNSW(1, vectors.vectorValue(1));
-    builder.addGraphNodeHNSW(2, vectors.vectorValue(2));
-    // now every node has tried to attach every other node as a neighbor, but
-    // some were excluded based on diversity check.
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    assertNeighbors(builder.hnsw, 1, 0);
-    assertNeighbors(builder.hnsw, 2, 0);
-
-    builder.addGraphNodeHNSW(3, vectors.vectorValue(3));
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    // we added 3 here
-    assertNeighbors(builder.hnsw, 1, 0, 3);
-    assertNeighbors(builder.hnsw, 2, 0);
-    assertNeighbors(builder.hnsw, 3, 1);
-
-    // supplant an existing neighbor
-    builder.addGraphNodeHNSW(4, vectors.vectorValue(4));
-    // 4 is the same distance from 0 that 2 is; we leave the existing node in place
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    // 4 is closer to 1 than either existing neighbor (0, 3). 3 fails diversity check with 4, so
-    // replace it
-    assertNeighbors(builder.hnsw, 1, 0, 4);
-    assertNeighbors(builder.hnsw, 2, 0);
-    // 1 survives the diversity check
-    assertNeighbors(builder.hnsw, 3, 1, 4);
-    assertNeighbors(builder.hnsw, 4, 1, 3);
-
-    builder.addGraphNodeHNSW(5, vectors.vectorValue(5));
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    assertNeighbors(builder.hnsw, 1, 0, 5);
-    assertNeighbors(builder.hnsw, 2, 0);
+    assertLevel0Neighbors(builder.hnsw, 0, 1, 2);
+    assertLevel0Neighbors(builder.hnsw, 1, 0, 5);
+    assertLevel0Neighbors(builder.hnsw, 2, 0);
     // even though 5 is closer, 3 is not a neighbor of 5, so no update to *its* neighbors occurs
-    assertNeighbors(builder.hnsw, 3, 1, 4);
-    assertNeighbors(builder.hnsw, 4, 3, 5);
-    assertNeighbors(builder.hnsw, 5, 1, 4);
+    assertLevel0Neighbors(builder.hnsw, 3, 1, 4);
+    assertLevel0Neighbors(builder.hnsw, 4, 3, 5);
+    assertLevel0Neighbors(builder.hnsw, 5, 1, 4);
   }
 
-  private void assertNeighbors(HnswGraph graph, int node, int... expected) {
+  private void assertLevel0Neighbors(HnswGraph graph, int node, int... expected) {

Review comment:
       Small comment, maybe we could fold `assertLevel0Neighbors` and `assertLevelNeighbors` together.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -204,6 +214,45 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel(FieldEntry entry) throws IOException {
+    IndexInput input =
+        graphIndex.slice("graph-index", entry.graphIndexOffset, entry.graphIndexLength);
+    int numOfLevels = input.readInt();
+    assert entry.numOfLevels == numOfLevels;
+    int[] numOfNodesByLevel = new int[numOfLevels];
+    List<int[]> nodesByLevel = new ArrayList<>(entry.numOfLevels);

Review comment:
       Could we use an array instead of `List` to keep things consistent/ simple?

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -77,13 +80,23 @@
               Lucene90HnswVectorsFormat.VECTOR_DATA_EXTENSION,
               Lucene90HnswVectorsFormat.VECTOR_DATA_CODEC_NAME,
               checksumRef);
-      vectorIndex =
+      graphIndex =
           openDataInput(
               state,
               versionMeta,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_CODEC_NAME,
               checksumRef);
+      graphData =
+          openDataInput(
+              state,
+              versionMeta,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_CODEC_NAME,
+              checksumRef);
+      // fill graph nodes and offsets by level.
+      // TODO: should we do this on the first field access?

Review comment:
       +1 to look into leaving this on disk in a follow-up. In addition to minimizing heap usage, it'd be nice to avoid synchronization logic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org