You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/09/24 16:35:39 UTC

[GitHub] [lucene] msokolov commented on a change in pull request #315: Disk write and read of hnsw graph

msokolov commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r715650308



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -77,13 +80,23 @@
               Lucene90HnswVectorsFormat.VECTOR_DATA_EXTENSION,
               Lucene90HnswVectorsFormat.VECTOR_DATA_CODEC_NAME,
               checksumRef);
-      vectorIndex =
+      graphIndex =
           openDataInput(
               state,
               versionMeta,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_CODEC_NAME,
               checksumRef);
+      graphData =
+          openDataInput(
+              state,
+              versionMeta,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_CODEC_NAME,
+              checksumRef);
+      // fill graph nodes and offsets by level.
+      // TODO: should we do this on the first field access?

Review comment:
       Maybe we could skip loading to heap, leave on disk; then we wouldn't need to preload at all. This probably has a bunch of complexities and tradeoffs though, so I wouldn't mix it in with this change. For now it makes sense to me to load here, and yeah to defer the loading. Quite possibly you refresh many times and never access the data here.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -204,6 +217,41 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =
+          graphIndex.slice("graph-index", entry.graphIndexOffset, entry.graphIndexLength);
+      int numOfLevels = input.readInt();
+      assert entry.numOfLevels == numOfLevels;
+      int[] numOfNodesByLevel = new int[numOfLevels];
+
+      // read nodes by level
+      for (int level = 0; level < numOfLevels; level++) {
+        numOfNodesByLevel[level] = input.readInt();
+        if (level == 0) {
+          entry.nodesByLevel.add(null);
+        } else {
+          final int[] nodesOnLevel = new int[numOfNodesByLevel[level]];
+          for (int i = 0; i < numOfNodesByLevel[level]; i++) {
+            nodesOnLevel[i] = input.readVInt();
+          }
+          entry.nodesByLevel.add(nodesOnLevel);
+        }
+      }
+
+      // read offsets by level
+      long offset = 0;
+      for (int level = 0; level < numOfLevels; level++) {
+        long[] ordOffsets = new long[numOfNodesByLevel[level]];

Review comment:
       nit: should we use `null` here if there are no nodes on a level, as we do above? Actually I prefer the empty array, but let's be consistent in any case.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -478,12 +531,25 @@ private void readValue(int targetOrd) throws IOException {
     IndexedKnnGraphReader(FieldEntry entry, IndexInput dataIn) {
       this.entry = entry;
       this.dataIn = dataIn;
+      this.entryNode =
+          entry.numOfLevels == 1 ? 0 : entry.nodesByLevel.get(entry.numOfLevels - 1)[0];
     }
 
     @Override
     public void seek(int level, int targetOrd) throws IOException {
+      long graphDataOffset;
+      if (level == 0) {
+        graphDataOffset = entry.ordOffsetsByLevel.get(0)[targetOrd];
+      } else {
+        int targetIndex =
+            Arrays.binarySearch(
+                entry.nodesByLevel.get(level), 0, entry.nodesByLevel.get(level).length, targetOrd);
+        assert targetIndex >= 0;

Review comment:
       or I see we will raise an AIOOBE with a negative offset - I guess that's fine.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -478,12 +531,25 @@ private void readValue(int targetOrd) throws IOException {
     IndexedKnnGraphReader(FieldEntry entry, IndexInput dataIn) {
       this.entry = entry;
       this.dataIn = dataIn;
+      this.entryNode =
+          entry.numOfLevels == 1 ? 0 : entry.nodesByLevel.get(entry.numOfLevels - 1)[0];
     }
 
     @Override
     public void seek(int level, int targetOrd) throws IOException {
+      long graphDataOffset;
+      if (level == 0) {
+        graphDataOffset = entry.ordOffsetsByLevel.get(0)[targetOrd];
+      } else {
+        int targetIndex =
+            Arrays.binarySearch(
+                entry.nodesByLevel.get(level), 0, entry.nodesByLevel.get(level).length, targetOrd);
+        assert targetIndex >= 0;

Review comment:
       this is a public method; I think we should test for the `targetIndex < 0` case and raise an exception? Maybe IllegalStateException, not sure.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -505,13 +571,47 @@ public int nextNeighbor() throws IOException {
     }
 
     @Override
-    public int maxLevel() throws IOException {
-      return 0;
+    public int numOfLevels() throws IOException {
+      return entry.numOfLevels;
     }
 
     @Override
     public int entryNode() throws IOException {
-      return 0;
+      return entryNode;
+    }
+
+    @Override
+    public DocIdSetIterator getAllNodesOnLevel(int level) {
+      return new DocIdSetIterator() {
+        int[] nodes = level == 0 ? null : entry.nodesByLevel.get(level);
+        int numOfNodes = level == 0 ? size() : nodes.length;
+        int idx = -1;
+
+        @Override
+        public int docID() {
+          return level == 0 ? idx : nodes[idx];

Review comment:
       should we specialize and return a different iterator for level 0?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org