You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/09/21 21:03:02 UTC

[GitHub] [lucene] mayya-sharipova opened a new pull request #315: Disk write and read of hnsw graph

mayya-sharipova opened a new pull request #315:
URL: https://github.com/apache/lucene/pull/315


   Disk write and read of hierarchical nsw graph.
   
   Modify graph files as following:
   
   - .vem - meta file
   - .vec - vector data file
   - .vex - vector graph index, index file for the vector graph file
   - .vgr - vector graph file, stores graph connections (previously
     called .vex file)
   
   .vem file
   +-------------+--------+------------------+------------------+-
   | FieldNumber | SimFun | VectorDataOffset | vectorDataLength |
   +-------------+--------+------------------+------------------+-
   
   -+------------------+------------------+-----------------+-----------------+-
    | graphIndexOffset | graphIndexLength | graphDataOffset | graphDataLength |
   -+------------------+------------------+-----------------+-----------------+-
   
   --+-------------+------------+------+--------+
     | NumOfLevels | Dimensions | Size | DocIds |
   --+-------------+------------+------+--------+
   
   - graph offsets are moved to .vex file.
   This allows to keep metada file smaller, and possibly
   later to to load graph offsets on first use, or make
   them off-heap.
   
   .vec file stays the same
   
   .vex file:
   +-------------+--------------------+--------------------+---------------+--
   | NumOfLevels | NumOfNodesOnLevel0 | NumOfNodesOnLevel1 | NodesOnLevel1 | ...
   +-------------+--------------------+--------------------+---------------+--
   
   -+----------------------+-----------------+-
    | NumOfNodesOnLevelMax | NodesOnLevelMax |
   -+----------------------+-----------------+--
   
   --+------------------------------+----+--------------------------------+
     | GraphOffsetsForNodesOnLevel0 | ...| GraphOffsetsForNodesOnLevelMax |
   --+------------------------------+----+--------------------------------+
   
   .vgr file:
   +------------------------+------------------------+-----+--------------------------+
   | Level0NodesConnections | Level1NodesConnections | ....| LevelMaxNodesConnections |
   +------------------------+------------------------+-----+--------------------------+
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

rmuir commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r721435977



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -301,55 +349,62 @@ public KnnGraphValues getGraphValues(String field) throws IOException {
       throw new IllegalArgumentException("No such field '" + field + "'");
     }
     FieldEntry entry = fields.get(field);
-    if (entry != null && entry.indexDataLength > 0) {
+    if (entry != null && entry.graphIndexLength > 0) {
       return getGraphValues(entry);
     } else {
       return KnnGraphValues.EMPTY;
     }
   }
 
   private KnnGraphValues getGraphValues(FieldEntry entry) throws IOException {
+    if (entry.ordOffsetsByLevel == null) {
+      synchronized (entry) {

Review comment:
       yes, it is just a bug that CheckIndex isn't inspecting the vectors.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r720879434



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsFormat.java
##########
@@ -69,10 +99,12 @@
 
   static final String META_CODEC_NAME = "Lucene90HnswVectorsFormatMeta";
   static final String VECTOR_DATA_CODEC_NAME = "Lucene90HnswVectorsFormatData";
-  static final String VECTOR_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatIndex";
+  static final String GRAPH_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatGraphIndex";
+  static final String GRAPH_DATA_CODEC_NAME = "Lucene90HnswVectorsFormatGraphData";
   static final String META_EXTENSION = "vem";
   static final String VECTOR_DATA_EXTENSION = "vec";
-  static final String VECTOR_INDEX_EXTENSION = "vex";
+  static final String GRAPH_INDEX_EXTENSION = "vex";
+  static final String GRAPH_DATA_EXTENSION = "vgr";

Review comment:
       I haven't noticed that, good observation; I've modified to have `vgr` in c55118c2231fda21706ee78665598f9416ac9f6e




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r720879574



##########
File path: lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java
##########
@@ -52,12 +53,21 @@ protected KnnGraphValues() {}
    */
   public abstract int nextNeighbor() throws IOException;
 
-  /** Returns top level of the graph * */
-  public abstract int maxLevel() throws IOException;
+  /** Returns the number of levels of the graph */
+  public abstract int numOfLevels() throws IOException;

Review comment:
       Great suggestion, addressed for this class and other classes in c55118c2231fda21706ee78665598f9416ac9f6e




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r717053652



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -77,13 +80,23 @@
               Lucene90HnswVectorsFormat.VECTOR_DATA_EXTENSION,
               Lucene90HnswVectorsFormat.VECTOR_DATA_CODEC_NAME,
               checksumRef);
-      vectorIndex =
+      graphIndex =
           openDataInput(
               state,
               versionMeta,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_CODEC_NAME,
               checksumRef);
+      graphData =
+          openDataInput(
+              state,
+              versionMeta,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_CODEC_NAME,
+              checksumRef);
+      // fill graph nodes and offsets by level.
+      // TODO: should we do this on the first field access?

Review comment:
       Addressed in 7e60c4d009652fcc6278bdb7f4a02982eb667900.  Defer loading of graph nodes and offsets to the first use

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -77,13 +80,23 @@
               Lucene90HnswVectorsFormat.VECTOR_DATA_EXTENSION,
               Lucene90HnswVectorsFormat.VECTOR_DATA_CODEC_NAME,
               checksumRef);
-      vectorIndex =
+      graphIndex =
           openDataInput(
               state,
               versionMeta,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_CODEC_NAME,
               checksumRef);
+      graphData =
+          openDataInput(
+              state,
+              versionMeta,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_CODEC_NAME,
+              checksumRef);
+      // fill graph nodes and offsets by level.
+      // TODO: should we do this on the first field access?

Review comment:
       Addressed in 7e60c4d009652fcc6278bdb7f4a02982eb667900.  Defered loading of graph nodes and offsets to the first use

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -204,6 +217,41 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =
+          graphIndex.slice("graph-index", entry.graphIndexOffset, entry.graphIndexLength);
+      int numOfLevels = input.readInt();
+      assert entry.numOfLevels == numOfLevels;
+      int[] numOfNodesByLevel = new int[numOfLevels];
+
+      // read nodes by level
+      for (int level = 0; level < numOfLevels; level++) {
+        numOfNodesByLevel[level] = input.readInt();
+        if (level == 0) {
+          entry.nodesByLevel.add(null);
+        } else {
+          final int[] nodesOnLevel = new int[numOfNodesByLevel[level]];
+          for (int i = 0; i < numOfNodesByLevel[level]; i++) {
+            nodesOnLevel[i] = input.readVInt();
+          }
+          entry.nodesByLevel.add(nodesOnLevel);
+        }
+      }
+
+      // read offsets by level
+      long offset = 0;
+      for (int level = 0; level < numOfLevels; level++) {
+        long[] ordOffsets = new long[numOfNodesByLevel[level]];

Review comment:
       There could not be a level without nodes. The reason why we add `nodesByLevel.add(null)` for level 0th, is that this level contains all nodes, so we don't need this info.
   
   I've added extra assertions and comments in 7e60c4d009652fcc6278bdb7f4a02982eb667900. Hopefully this clarifies  things.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -478,12 +531,25 @@ private void readValue(int targetOrd) throws IOException {
     IndexedKnnGraphReader(FieldEntry entry, IndexInput dataIn) {
       this.entry = entry;
       this.dataIn = dataIn;
+      this.entryNode =
+          entry.numOfLevels == 1 ? 0 : entry.nodesByLevel.get(entry.numOfLevels - 1)[0];
     }
 
     @Override
     public void seek(int level, int targetOrd) throws IOException {
+      long graphDataOffset;
+      if (level == 0) {
+        graphDataOffset = entry.ordOffsetsByLevel.get(0)[targetOrd];
+      } else {
+        int targetIndex =
+            Arrays.binarySearch(
+                entry.nodesByLevel.get(level), 0, entry.nodesByLevel.get(level).length, targetOrd);
+        assert targetIndex >= 0;

Review comment:
       Good notice, I've removed unnecessary assertion `assert targetIndex >= 0;` as in this case we will get AIOOBE

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -505,13 +571,47 @@ public int nextNeighbor() throws IOException {
     }
 
     @Override
-    public int maxLevel() throws IOException {
-      return 0;
+    public int numOfLevels() throws IOException {
+      return entry.numOfLevels;
     }
 
     @Override
     public int entryNode() throws IOException {
-      return 0;
+      return entryNode;
+    }
+
+    @Override
+    public DocIdSetIterator getAllNodesOnLevel(int level) {
+      return new DocIdSetIterator() {
+        int[] nodes = level == 0 ? null : entry.nodesByLevel.get(level);
+        int numOfNodes = level == 0 ? size() : nodes.length;
+        int idx = -1;
+
+        @Override
+        public int docID() {
+          return level == 0 ? idx : nodes[idx];

Review comment:
       Addressed in 7e60c4d




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r720914724



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -204,6 +214,45 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel(FieldEntry entry) throws IOException {
+    IndexInput input =
+        graphIndex.slice("graph-index", entry.graphIndexOffset, entry.graphIndexLength);
+    int numOfLevels = input.readInt();
+    assert entry.numOfLevels == numOfLevels;
+    int[] numOfNodesByLevel = new int[numOfLevels];
+    List<int[]> nodesByLevel = new ArrayList<>(entry.numOfLevels);

Review comment:
       Great comment! Addressed in 41befe7deb28730a7fa3c5747ba93f83b9257d2f




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on pull request #315:
URL: https://github.com/apache/lucene/pull/315#issuecomment-928298036


   @msokolov Thanks a lot for your review. I've tried to address your comments in the second comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

msokolov commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r715650308



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -77,13 +80,23 @@
               Lucene90HnswVectorsFormat.VECTOR_DATA_EXTENSION,
               Lucene90HnswVectorsFormat.VECTOR_DATA_CODEC_NAME,
               checksumRef);
-      vectorIndex =
+      graphIndex =
           openDataInput(
               state,
               versionMeta,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_CODEC_NAME,
               checksumRef);
+      graphData =
+          openDataInput(
+              state,
+              versionMeta,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_CODEC_NAME,
+              checksumRef);
+      // fill graph nodes and offsets by level.
+      // TODO: should we do this on the first field access?

Review comment:
       Maybe we could skip loading to heap, leave on disk; then we wouldn't need to preload at all. This probably has a bunch of complexities and tradeoffs though, so I wouldn't mix it in with this change. For now it makes sense to me to load here, and yeah to defer the loading. Quite possibly you refresh many times and never access the data here.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -204,6 +217,41 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =
+          graphIndex.slice("graph-index", entry.graphIndexOffset, entry.graphIndexLength);
+      int numOfLevels = input.readInt();
+      assert entry.numOfLevels == numOfLevels;
+      int[] numOfNodesByLevel = new int[numOfLevels];
+
+      // read nodes by level
+      for (int level = 0; level < numOfLevels; level++) {
+        numOfNodesByLevel[level] = input.readInt();
+        if (level == 0) {
+          entry.nodesByLevel.add(null);
+        } else {
+          final int[] nodesOnLevel = new int[numOfNodesByLevel[level]];
+          for (int i = 0; i < numOfNodesByLevel[level]; i++) {
+            nodesOnLevel[i] = input.readVInt();
+          }
+          entry.nodesByLevel.add(nodesOnLevel);
+        }
+      }
+
+      // read offsets by level
+      long offset = 0;
+      for (int level = 0; level < numOfLevels; level++) {
+        long[] ordOffsets = new long[numOfNodesByLevel[level]];

Review comment:
       nit: should we use `null` here if there are no nodes on a level, as we do above? Actually I prefer the empty array, but let's be consistent in any case.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -478,12 +531,25 @@ private void readValue(int targetOrd) throws IOException {
     IndexedKnnGraphReader(FieldEntry entry, IndexInput dataIn) {
       this.entry = entry;
       this.dataIn = dataIn;
+      this.entryNode =
+          entry.numOfLevels == 1 ? 0 : entry.nodesByLevel.get(entry.numOfLevels - 1)[0];
     }
 
     @Override
     public void seek(int level, int targetOrd) throws IOException {
+      long graphDataOffset;
+      if (level == 0) {
+        graphDataOffset = entry.ordOffsetsByLevel.get(0)[targetOrd];
+      } else {
+        int targetIndex =
+            Arrays.binarySearch(
+                entry.nodesByLevel.get(level), 0, entry.nodesByLevel.get(level).length, targetOrd);
+        assert targetIndex >= 0;

Review comment:
       or I see we will raise an AIOOBE with a negative offset - I guess that's fine.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -478,12 +531,25 @@ private void readValue(int targetOrd) throws IOException {
     IndexedKnnGraphReader(FieldEntry entry, IndexInput dataIn) {
       this.entry = entry;
       this.dataIn = dataIn;
+      this.entryNode =
+          entry.numOfLevels == 1 ? 0 : entry.nodesByLevel.get(entry.numOfLevels - 1)[0];
     }
 
     @Override
     public void seek(int level, int targetOrd) throws IOException {
+      long graphDataOffset;
+      if (level == 0) {
+        graphDataOffset = entry.ordOffsetsByLevel.get(0)[targetOrd];
+      } else {
+        int targetIndex =
+            Arrays.binarySearch(
+                entry.nodesByLevel.get(level), 0, entry.nodesByLevel.get(level).length, targetOrd);
+        assert targetIndex >= 0;

Review comment:
       this is a public method; I think we should test for the `targetIndex < 0` case and raise an exception? Maybe IllegalStateException, not sure.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -505,13 +571,47 @@ public int nextNeighbor() throws IOException {
     }
 
     @Override
-    public int maxLevel() throws IOException {
-      return 0;
+    public int numOfLevels() throws IOException {
+      return entry.numOfLevels;
     }
 
     @Override
     public int entryNode() throws IOException {
-      return 0;
+      return entryNode;
+    }
+
+    @Override
+    public DocIdSetIterator getAllNodesOnLevel(int level) {
+      return new DocIdSetIterator() {
+        int[] nodes = level == 0 ? null : entry.nodesByLevel.get(level);
+        int numOfNodes = level == 0 ? size() : nodes.length;
+        int idx = -1;
+
+        @Override
+        public int docID() {
+          return level == 0 ? idx : nodes[idx];

Review comment:
       should we specialize and return a different iterator for level 0?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

msokolov commented on pull request #315:
URL: https://github.com/apache/lucene/pull/315#issuecomment-926762372


   I did run a quick benchmark on an internal data-set (using KnnGraphTester) and saw some improvement to both latency and recall, and a reduction in nodes visited from this change:
   
   condition|recall|  latency| nDoc       |    fanout|  maxConn| beamWidth|   visited| index ms
   ------------|-------|-----------|--------------|------------|---------------|---------------|------------|----------------
   main       |0.815|     0.91|   1000000|          50|              32|              64|       2590|    484529
   change   |0.836|    0.86 |   1000000 |         50|              32|              64|       2331|    462900
   
   and I think this test is actually unfair to HNSW because of the fanout which incurs some additional cost in the HNSW case for no benefit (it just increases top K by that amount and then throws away the extras).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

rmuir commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r721296208



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -301,55 +349,62 @@ public KnnGraphValues getGraphValues(String field) throws IOException {
       throw new IllegalArgumentException("No such field '" + field + "'");
     }
     FieldEntry entry = fields.get(field);
-    if (entry != null && entry.indexDataLength > 0) {
+    if (entry != null && entry.graphIndexLength > 0) {
       return getGraphValues(entry);
     } else {
       return KnnGraphValues.EMPTY;
     }
   }
 
   private KnnGraphValues getGraphValues(FieldEntry entry) throws IOException {
+    if (entry.ordOffsetsByLevel == null) {
+      synchronized (entry) {

Review comment:
       can we remove this? Looks like lazy-loading.
   
   The problem with this, is that it doesn't buy anything to lazy-load. At first it might seem like it saves some memory if the feature isn't used, until merge happens, then BOOM.
   
   I think instead we should keep it simple and just load up front, and spend our time trying to reduce the amount of memory that this thing uses.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on pull request #315:
URL: https://github.com/apache/lucene/pull/315#issuecomment-929588721


   I'm not an expert in designing formats, but this looks good to me overall! I left some minor comments.
   
   > EG I saw Julie changed the default beam-width to 100, which is fine, makes sense for the public datasets, but I've observed the 32 we initially had as being a better choice for some of the vectors we work with.
   
   That's interesting to know, thanks for sharing. As a clarification, the default beamWidth used to be 16.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r720879434



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsFormat.java
##########
@@ -69,10 +99,12 @@
 
   static final String META_CODEC_NAME = "Lucene90HnswVectorsFormatMeta";
   static final String VECTOR_DATA_CODEC_NAME = "Lucene90HnswVectorsFormatData";
-  static final String VECTOR_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatIndex";
+  static final String GRAPH_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatGraphIndex";
+  static final String GRAPH_DATA_CODEC_NAME = "Lucene90HnswVectorsFormatGraphData";
   static final String META_EXTENSION = "vem";
   static final String VECTOR_DATA_EXTENSION = "vec";
-  static final String VECTOR_INDEX_EXTENSION = "vex";
+  static final String GRAPH_INDEX_EXTENSION = "vex";
+  static final String GRAPH_DATA_EXTENSION = "vgr";

Review comment:
       I haven't noticed that, good observation; I've modified to have `veg` in c55118c2231fda21706ee78665598f9416ac9f6e




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on pull request #315:
URL: https://github.com/apache/lucene/pull/315#issuecomment-924560110


   The benchmark results are surprising to me. To double-check, I pulled down your code and ran the benchmark on my machine. For each value of `num_cands`, the recall and QPS match the old numbers almost exactly. Even if there wasn't a big performance improvement, I wouldn't expect them to match up so closely!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on pull request #315:
URL: https://github.com/apache/lucene/pull/315#issuecomment-933034051


   @jtibshirani Thanks for another round of review. I've tried to address your comments in the last commits.  Please continue to review when you have time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on pull request #315:
URL: https://github.com/apache/lucene/pull/315#issuecomment-924385291


   
   `sift-128-euclidean`
   Built index in 3391.124460220337
   
   
   ```bash
                           recall       QPS
   LuceneHnsw(n_cands=10)  0.606     6253.181
   LuceneHnsw(n_cands=50)  0.890     3058.817
   LuceneHnsw(n_cands=100) 0.952     1949.500
   LuceneHnsw(n_cands=500) 0.996      521.085
   LuceneHnsw(n_cands=800) 0.998      360.440
   ```
   
   ----
   `glove-100-angular`
   
   Built index in 10102.856158971786
   
   ```bash
                           recall       QPS
   LuceneHnsw(n_cands=10)   0.511     3771.049
   LuceneHnsw(n_cands=50)   0.761     1619.984
   LuceneHnsw(n_cands=100)  0.834      988.812
   LuceneHnsw(n_cands=500)  0.941      256.745
   LuceneHnsw(n_cands=800)  0.962      169.147
   ```
   
   Doesn't seem much different from the original results reported by @jtibshirani  in  [LUCENE-9937](https://issues.apache.org/jira/browse/LUCENE-9937)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova merged pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova merged pull request #315:
URL: https://github.com/apache/lucene/pull/315


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

msokolov commented on pull request #315:
URL: https://github.com/apache/lucene/pull/315#issuecomment-926634656


   hmm, I can give it a try with a different data set.  I find these things are quite different with different datasets. EG I saw Julie changed the default beam-width to 100, which is fine, makes sense for the public datasets, but I've observed the 32 we initially had as being a better choice for some of the vectors we work with. So maybe the hierarchical graphs value varies like that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r717057996



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -204,6 +217,41 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =
+          graphIndex.slice("graph-index", entry.graphIndexOffset, entry.graphIndexLength);
+      int numOfLevels = input.readInt();
+      assert entry.numOfLevels == numOfLevels;
+      int[] numOfNodesByLevel = new int[numOfLevels];
+
+      // read nodes by level
+      for (int level = 0; level < numOfLevels; level++) {
+        numOfNodesByLevel[level] = input.readInt();
+        if (level == 0) {
+          entry.nodesByLevel.add(null);
+        } else {
+          final int[] nodesOnLevel = new int[numOfNodesByLevel[level]];
+          for (int i = 0; i < numOfNodesByLevel[level]; i++) {
+            nodesOnLevel[i] = input.readVInt();
+          }
+          entry.nodesByLevel.add(nodesOnLevel);
+        }
+      }
+
+      // read offsets by level
+      long offset = 0;
+      for (int level = 0; level < numOfLevels; level++) {
+        long[] ordOffsets = new long[numOfNodesByLevel[level]];

Review comment:
       There could not be a level without nodes. The reason why we add `nodesByLevel.add(null)` for level 0th, is that this level contains all nodes, so we don't need this info.
   
   I've added extra assertions and comments in 7e60c4d009652fcc6278bdb7f4a02982eb667900. Hopefully this clarifies  things.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

rmuir commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r721419192



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -301,55 +349,62 @@ public KnnGraphValues getGraphValues(String field) throws IOException {
       throw new IllegalArgumentException("No such field '" + field + "'");
     }
     FieldEntry entry = fields.get(field);
-    if (entry != null && entry.indexDataLength > 0) {
+    if (entry != null && entry.graphIndexLength > 0) {
       return getGraphValues(entry);
     } else {
       return KnnGraphValues.EMPTY;
     }
   }
 
   private KnnGraphValues getGraphValues(FieldEntry entry) throws IOException {
+    if (entry.ordOffsetsByLevel == null) {
+      synchronized (entry) {

Review comment:
       I'm confused, why is it a problem to just load it once, up-front, in Lucene90HnswVectorsReader constructor? This should be only created once per IndexReader.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r720879315



##########
File path: lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java
##########
@@ -268,103 +281,47 @@ public void testDiversity() throws IOException {
     // First add nodes until everybody gets a full neighbor list
     HnswGraphBuilder builder =
         new HnswGraphBuilder(
-            vectors, VectorSimilarityFunction.DOT_PRODUCT, 2, 10, 0, random().nextInt());
+            vectors, VectorSimilarityFunction.DOT_PRODUCT, 2, 10, random().nextInt());
     // node 0 is added by the builder constructor
     // builder.addGraphNode(vectors.vectorValue(0));
     builder.addGraphNode(1, vectors.vectorValue(1));
     builder.addGraphNode(2, vectors.vectorValue(2));
     // now every node has tried to attach every other node as a neighbor, but
     // some were excluded based on diversity check.
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    assertNeighbors(builder.hnsw, 1, 0);
-    assertNeighbors(builder.hnsw, 2, 0);
+    assertLevel0Neighbors(builder.hnsw, 0, 1, 2);
+    assertLevel0Neighbors(builder.hnsw, 1, 0);
+    assertLevel0Neighbors(builder.hnsw, 2, 0);
 
     builder.addGraphNode(3, vectors.vectorValue(3));
-    assertNeighbors(builder.hnsw, 0, 1, 2);
+    assertLevel0Neighbors(builder.hnsw, 0, 1, 2);
     // we added 3 here
-    assertNeighbors(builder.hnsw, 1, 0, 3);
-    assertNeighbors(builder.hnsw, 2, 0);
-    assertNeighbors(builder.hnsw, 3, 1);
+    assertLevel0Neighbors(builder.hnsw, 1, 0, 3);
+    assertLevel0Neighbors(builder.hnsw, 2, 0);
+    assertLevel0Neighbors(builder.hnsw, 3, 1);
 
     // supplant an existing neighbor
     builder.addGraphNode(4, vectors.vectorValue(4));
     // 4 is the same distance from 0 that 2 is; we leave the existing node in place
-    assertNeighbors(builder.hnsw, 0, 1, 2);
+    assertLevel0Neighbors(builder.hnsw, 0, 1, 2);
     // 4 is closer to 1 than either existing neighbor (0, 3). 3 fails diversity check with 4, so
     // replace it
-    assertNeighbors(builder.hnsw, 1, 0, 4);
-    assertNeighbors(builder.hnsw, 2, 0);
+    assertLevel0Neighbors(builder.hnsw, 1, 0, 4);
+    assertLevel0Neighbors(builder.hnsw, 2, 0);
     // 1 survives the diversity check
-    assertNeighbors(builder.hnsw, 3, 1, 4);
-    assertNeighbors(builder.hnsw, 4, 1, 3);
+    assertLevel0Neighbors(builder.hnsw, 3, 1, 4);
+    assertLevel0Neighbors(builder.hnsw, 4, 1, 3);
 
     builder.addGraphNode(5, vectors.vectorValue(5));
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    assertNeighbors(builder.hnsw, 1, 0, 5);
-    assertNeighbors(builder.hnsw, 2, 0);
-    // even though 5 is closer, 3 is not a neighbor of 5, so no update to *its* neighbors occurs
-    assertNeighbors(builder.hnsw, 3, 1, 4);
-    assertNeighbors(builder.hnsw, 4, 3, 5);
-    assertNeighbors(builder.hnsw, 5, 1, 4);
-  }
-
-  public void testDiversityHNSW() throws IOException {
-    // Some carefully checked test cases with simple 2d vectors on the unit circle:
-    MockVectorValues vectors =
-        new MockVectorValues(
-            new float[][] {
-              unitVector2d(0.5),
-              unitVector2d(0.75),
-              unitVector2d(0.2),
-              unitVector2d(0.9),
-              unitVector2d(0.8),
-              unitVector2d(0.77),
-            });
-    // First add nodes until everybody gets a full neighbor list
-    int maxConn = 2;
-    double ml = 1 / Math.log(1.0 * maxConn);
-    HnswGraphBuilder builder =
-        new HnswGraphBuilder(
-            vectors, VectorSimilarityFunction.DOT_PRODUCT, maxConn, 10, ml, random().nextInt());
-    // node 0 is added by the builder constructor
-    builder.addGraphNodeHNSW(1, vectors.vectorValue(1));
-    builder.addGraphNodeHNSW(2, vectors.vectorValue(2));
-    // now every node has tried to attach every other node as a neighbor, but
-    // some were excluded based on diversity check.
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    assertNeighbors(builder.hnsw, 1, 0);
-    assertNeighbors(builder.hnsw, 2, 0);
-
-    builder.addGraphNodeHNSW(3, vectors.vectorValue(3));
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    // we added 3 here
-    assertNeighbors(builder.hnsw, 1, 0, 3);
-    assertNeighbors(builder.hnsw, 2, 0);
-    assertNeighbors(builder.hnsw, 3, 1);
-
-    // supplant an existing neighbor
-    builder.addGraphNodeHNSW(4, vectors.vectorValue(4));
-    // 4 is the same distance from 0 that 2 is; we leave the existing node in place
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    // 4 is closer to 1 than either existing neighbor (0, 3). 3 fails diversity check with 4, so
-    // replace it
-    assertNeighbors(builder.hnsw, 1, 0, 4);
-    assertNeighbors(builder.hnsw, 2, 0);
-    // 1 survives the diversity check
-    assertNeighbors(builder.hnsw, 3, 1, 4);
-    assertNeighbors(builder.hnsw, 4, 1, 3);
-
-    builder.addGraphNodeHNSW(5, vectors.vectorValue(5));
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    assertNeighbors(builder.hnsw, 1, 0, 5);
-    assertNeighbors(builder.hnsw, 2, 0);
+    assertLevel0Neighbors(builder.hnsw, 0, 1, 2);
+    assertLevel0Neighbors(builder.hnsw, 1, 0, 5);
+    assertLevel0Neighbors(builder.hnsw, 2, 0);
     // even though 5 is closer, 3 is not a neighbor of 5, so no update to *its* neighbors occurs
-    assertNeighbors(builder.hnsw, 3, 1, 4);
-    assertNeighbors(builder.hnsw, 4, 3, 5);
-    assertNeighbors(builder.hnsw, 5, 1, 4);
+    assertLevel0Neighbors(builder.hnsw, 3, 1, 4);
+    assertLevel0Neighbors(builder.hnsw, 4, 3, 5);
+    assertLevel0Neighbors(builder.hnsw, 5, 1, 4);
   }
 
-  private void assertNeighbors(HnswGraph graph, int node, int... expected) {
+  private void assertLevel0Neighbors(HnswGraph graph, int node, int... expected) {

Review comment:
       Good suggestion, addressed in c55118c2231fda21706ee78665598f9416ac9f6e




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r721427046



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -301,55 +349,62 @@ public KnnGraphValues getGraphValues(String field) throws IOException {
       throw new IllegalArgumentException("No such field '" + field + "'");
     }
     FieldEntry entry = fields.get(field);
-    if (entry != null && entry.indexDataLength > 0) {
+    if (entry != null && entry.graphIndexLength > 0) {
       return getGraphValues(entry);
     } else {
       return KnnGraphValues.EMPTY;
     }
   }
 
   private KnnGraphValues getGraphValues(FieldEntry entry) throws IOException {
+    if (entry.ordOffsetsByLevel == null) {
+      synchronized (entry) {

Review comment:
       I guess, I assumed that we may access Lucene90HnswVectorsReader constructor also when we don't need to do any graph data and  graph search, but just to check an index as in `CheckIndex:checkIndex`.  But may be this is a wrong assumption outside tests, and we always access segment readers when need to search them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r717058396



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -478,12 +531,25 @@ private void readValue(int targetOrd) throws IOException {
     IndexedKnnGraphReader(FieldEntry entry, IndexInput dataIn) {
       this.entry = entry;
       this.dataIn = dataIn;
+      this.entryNode =
+          entry.numOfLevels == 1 ? 0 : entry.nodesByLevel.get(entry.numOfLevels - 1)[0];
     }
 
     @Override
     public void seek(int level, int targetOrd) throws IOException {
+      long graphDataOffset;
+      if (level == 0) {
+        graphDataOffset = entry.ordOffsetsByLevel.get(0)[targetOrd];
+      } else {
+        int targetIndex =
+            Arrays.binarySearch(
+                entry.nodesByLevel.get(level), 0, entry.nodesByLevel.get(level).length, targetOrd);
+        assert targetIndex >= 0;

Review comment:
       Good notice, I've removed unnecessary assertion `assert targetIndex >= 0;` as in this case we will get AIOOBE




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r717773659



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsFormat.java
##########
@@ -69,10 +99,12 @@
 
   static final String META_CODEC_NAME = "Lucene90HnswVectorsFormatMeta";
   static final String VECTOR_DATA_CODEC_NAME = "Lucene90HnswVectorsFormatData";
-  static final String VECTOR_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatIndex";
+  static final String GRAPH_INDEX_CODEC_NAME = "Lucene90HnswVectorsFormatGraphIndex";
+  static final String GRAPH_DATA_CODEC_NAME = "Lucene90HnswVectorsFormatGraphData";
   static final String META_EXTENSION = "vem";
   static final String VECTOR_DATA_EXTENSION = "vec";
-  static final String VECTOR_INDEX_EXTENSION = "vex";
+  static final String GRAPH_INDEX_EXTENSION = "vex";
+  static final String GRAPH_DATA_EXTENSION = "vgr";

Review comment:
       Not a big deal but looking at other formats, it's very common for extensions to start with the same characters. So maybe this could be `veg` instead of `vgr`.

##########
File path: lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java
##########
@@ -52,12 +53,21 @@ protected KnnGraphValues() {}
    */
   public abstract int nextNeighbor() throws IOException;
 
-  /** Returns top level of the graph * */
-  public abstract int maxLevel() throws IOException;
+  /** Returns the number of levels of the graph */
+  public abstract int numOfLevels() throws IOException;

Review comment:
       Super small comment, saying `numLevels` seems more typical?

##########
File path: lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java
##########
@@ -268,103 +281,47 @@ public void testDiversity() throws IOException {
     // First add nodes until everybody gets a full neighbor list
     HnswGraphBuilder builder =
         new HnswGraphBuilder(
-            vectors, VectorSimilarityFunction.DOT_PRODUCT, 2, 10, 0, random().nextInt());
+            vectors, VectorSimilarityFunction.DOT_PRODUCT, 2, 10, random().nextInt());
     // node 0 is added by the builder constructor
     // builder.addGraphNode(vectors.vectorValue(0));
     builder.addGraphNode(1, vectors.vectorValue(1));
     builder.addGraphNode(2, vectors.vectorValue(2));
     // now every node has tried to attach every other node as a neighbor, but
     // some were excluded based on diversity check.
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    assertNeighbors(builder.hnsw, 1, 0);
-    assertNeighbors(builder.hnsw, 2, 0);
+    assertLevel0Neighbors(builder.hnsw, 0, 1, 2);
+    assertLevel0Neighbors(builder.hnsw, 1, 0);
+    assertLevel0Neighbors(builder.hnsw, 2, 0);
 
     builder.addGraphNode(3, vectors.vectorValue(3));
-    assertNeighbors(builder.hnsw, 0, 1, 2);
+    assertLevel0Neighbors(builder.hnsw, 0, 1, 2);
     // we added 3 here
-    assertNeighbors(builder.hnsw, 1, 0, 3);
-    assertNeighbors(builder.hnsw, 2, 0);
-    assertNeighbors(builder.hnsw, 3, 1);
+    assertLevel0Neighbors(builder.hnsw, 1, 0, 3);
+    assertLevel0Neighbors(builder.hnsw, 2, 0);
+    assertLevel0Neighbors(builder.hnsw, 3, 1);
 
     // supplant an existing neighbor
     builder.addGraphNode(4, vectors.vectorValue(4));
     // 4 is the same distance from 0 that 2 is; we leave the existing node in place
-    assertNeighbors(builder.hnsw, 0, 1, 2);
+    assertLevel0Neighbors(builder.hnsw, 0, 1, 2);
     // 4 is closer to 1 than either existing neighbor (0, 3). 3 fails diversity check with 4, so
     // replace it
-    assertNeighbors(builder.hnsw, 1, 0, 4);
-    assertNeighbors(builder.hnsw, 2, 0);
+    assertLevel0Neighbors(builder.hnsw, 1, 0, 4);
+    assertLevel0Neighbors(builder.hnsw, 2, 0);
     // 1 survives the diversity check
-    assertNeighbors(builder.hnsw, 3, 1, 4);
-    assertNeighbors(builder.hnsw, 4, 1, 3);
+    assertLevel0Neighbors(builder.hnsw, 3, 1, 4);
+    assertLevel0Neighbors(builder.hnsw, 4, 1, 3);
 
     builder.addGraphNode(5, vectors.vectorValue(5));
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    assertNeighbors(builder.hnsw, 1, 0, 5);
-    assertNeighbors(builder.hnsw, 2, 0);
-    // even though 5 is closer, 3 is not a neighbor of 5, so no update to *its* neighbors occurs
-    assertNeighbors(builder.hnsw, 3, 1, 4);
-    assertNeighbors(builder.hnsw, 4, 3, 5);
-    assertNeighbors(builder.hnsw, 5, 1, 4);
-  }
-
-  public void testDiversityHNSW() throws IOException {
-    // Some carefully checked test cases with simple 2d vectors on the unit circle:
-    MockVectorValues vectors =
-        new MockVectorValues(
-            new float[][] {
-              unitVector2d(0.5),
-              unitVector2d(0.75),
-              unitVector2d(0.2),
-              unitVector2d(0.9),
-              unitVector2d(0.8),
-              unitVector2d(0.77),
-            });
-    // First add nodes until everybody gets a full neighbor list
-    int maxConn = 2;
-    double ml = 1 / Math.log(1.0 * maxConn);
-    HnswGraphBuilder builder =
-        new HnswGraphBuilder(
-            vectors, VectorSimilarityFunction.DOT_PRODUCT, maxConn, 10, ml, random().nextInt());
-    // node 0 is added by the builder constructor
-    builder.addGraphNodeHNSW(1, vectors.vectorValue(1));
-    builder.addGraphNodeHNSW(2, vectors.vectorValue(2));
-    // now every node has tried to attach every other node as a neighbor, but
-    // some were excluded based on diversity check.
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    assertNeighbors(builder.hnsw, 1, 0);
-    assertNeighbors(builder.hnsw, 2, 0);
-
-    builder.addGraphNodeHNSW(3, vectors.vectorValue(3));
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    // we added 3 here
-    assertNeighbors(builder.hnsw, 1, 0, 3);
-    assertNeighbors(builder.hnsw, 2, 0);
-    assertNeighbors(builder.hnsw, 3, 1);
-
-    // supplant an existing neighbor
-    builder.addGraphNodeHNSW(4, vectors.vectorValue(4));
-    // 4 is the same distance from 0 that 2 is; we leave the existing node in place
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    // 4 is closer to 1 than either existing neighbor (0, 3). 3 fails diversity check with 4, so
-    // replace it
-    assertNeighbors(builder.hnsw, 1, 0, 4);
-    assertNeighbors(builder.hnsw, 2, 0);
-    // 1 survives the diversity check
-    assertNeighbors(builder.hnsw, 3, 1, 4);
-    assertNeighbors(builder.hnsw, 4, 1, 3);
-
-    builder.addGraphNodeHNSW(5, vectors.vectorValue(5));
-    assertNeighbors(builder.hnsw, 0, 1, 2);
-    assertNeighbors(builder.hnsw, 1, 0, 5);
-    assertNeighbors(builder.hnsw, 2, 0);
+    assertLevel0Neighbors(builder.hnsw, 0, 1, 2);
+    assertLevel0Neighbors(builder.hnsw, 1, 0, 5);
+    assertLevel0Neighbors(builder.hnsw, 2, 0);
     // even though 5 is closer, 3 is not a neighbor of 5, so no update to *its* neighbors occurs
-    assertNeighbors(builder.hnsw, 3, 1, 4);
-    assertNeighbors(builder.hnsw, 4, 3, 5);
-    assertNeighbors(builder.hnsw, 5, 1, 4);
+    assertLevel0Neighbors(builder.hnsw, 3, 1, 4);
+    assertLevel0Neighbors(builder.hnsw, 4, 3, 5);
+    assertLevel0Neighbors(builder.hnsw, 5, 1, 4);
   }
 
-  private void assertNeighbors(HnswGraph graph, int node, int... expected) {
+  private void assertLevel0Neighbors(HnswGraph graph, int node, int... expected) {

Review comment:
       Small comment, maybe we could fold `assertLevel0Neighbors` and `assertLevelNeighbors` together.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -204,6 +214,45 @@ private FieldEntry readField(DataInput input) throws IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel(FieldEntry entry) throws IOException {
+    IndexInput input =
+        graphIndex.slice("graph-index", entry.graphIndexOffset, entry.graphIndexLength);
+    int numOfLevels = input.readInt();
+    assert entry.numOfLevels == numOfLevels;
+    int[] numOfNodesByLevel = new int[numOfLevels];
+    List<int[]> nodesByLevel = new ArrayList<>(entry.numOfLevels);

Review comment:
       Could we use an array instead of `List` to keep things consistent/ simple?

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -77,13 +80,23 @@
               Lucene90HnswVectorsFormat.VECTOR_DATA_EXTENSION,
               Lucene90HnswVectorsFormat.VECTOR_DATA_CODEC_NAME,
               checksumRef);
-      vectorIndex =
+      graphIndex =
           openDataInput(
               state,
               versionMeta,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_CODEC_NAME,
               checksumRef);
+      graphData =
+          openDataInput(
+              state,
+              versionMeta,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_CODEC_NAME,
+              checksumRef);
+      // fill graph nodes and offsets by level.
+      // TODO: should we do this on the first field access?

Review comment:
       +1 to look into leaving this on disk in a follow-up. In addition to minimizing heap usage, it'd be nice to avoid synchronization logic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r717053652



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -77,13 +80,23 @@
               Lucene90HnswVectorsFormat.VECTOR_DATA_EXTENSION,
               Lucene90HnswVectorsFormat.VECTOR_DATA_CODEC_NAME,
               checksumRef);
-      vectorIndex =
+      graphIndex =
           openDataInput(
               state,
               versionMeta,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_CODEC_NAME,
               checksumRef);
+      graphData =
+          openDataInput(
+              state,
+              versionMeta,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_CODEC_NAME,
+              checksumRef);
+      // fill graph nodes and offsets by level.
+      // TODO: should we do this on the first field access?

Review comment:
       Addressed in 7e60c4d009652fcc6278bdb7f4a02982eb667900.  Defer loading of graph nodes and offsets to the first use

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -77,13 +80,23 @@
               Lucene90HnswVectorsFormat.VECTOR_DATA_EXTENSION,
               Lucene90HnswVectorsFormat.VECTOR_DATA_CODEC_NAME,
               checksumRef);
-      vectorIndex =
+      graphIndex =
           openDataInput(
               state,
               versionMeta,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
-              Lucene90HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_INDEX_CODEC_NAME,
               checksumRef);
+      graphData =
+          openDataInput(
+              state,
+              versionMeta,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_EXTENSION,
+              Lucene90HnswVectorsFormat.GRAPH_DATA_CODEC_NAME,
+              checksumRef);
+      // fill graph nodes and offsets by level.
+      // TODO: should we do this on the first field access?

Review comment:
       Addressed in 7e60c4d009652fcc6278bdb7f4a02982eb667900.  Defered loading of graph nodes and offsets to the first use




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on pull request #315:
URL: https://github.com/apache/lucene/pull/315#issuecomment-928298036


   @msokolov Thanks a lot for your review. I've tried to address your comments in the second comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r721410089



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -301,55 +349,62 @@ public KnnGraphValues getGraphValues(String field) throws IOException {
       throw new IllegalArgumentException("No such field '" + field + "'");
     }
     FieldEntry entry = fields.get(field);
-    if (entry != null && entry.indexDataLength > 0) {
+    if (entry != null && entry.graphIndexLength > 0) {
       return getGraphValues(entry);
     } else {
       return KnnGraphValues.EMPTY;
     }
   }
 
   private KnnGraphValues getGraphValues(FieldEntry entry) throws IOException {
+    if (entry.ordOffsetsByLevel == null) {
+      synchronized (entry) {

Review comment:
       @rmuir Thanks for checking.  Good to know that we don't encourage lazy loading, and I will revert this change from this PR, and avoid doing lazy loading in the future.
   
   The problem with the current code is that we load all the graph data into memory in the Reader constructor,  but we may instantiate this reader many times, and for some of these times we don't need the graph data (e.g. if we just do CheckIndex).  But we have a plan to address this to use instead off-heap graph data.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

msokolov commented on pull request #315:
URL: https://github.com/apache/lucene/pull/315#issuecomment-926653417


   Also, is it possible to report avg(num found). IIRC we are reporting the number of graph nodes visited as the TopDocs total hits value, and this can give some deeper insight into the differences


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r721410089



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -301,55 +349,62 @@ public KnnGraphValues getGraphValues(String field) throws IOException {
       throw new IllegalArgumentException("No such field '" + field + "'");
     }
     FieldEntry entry = fields.get(field);
-    if (entry != null && entry.indexDataLength > 0) {
+    if (entry != null && entry.graphIndexLength > 0) {
       return getGraphValues(entry);
     } else {
       return KnnGraphValues.EMPTY;
     }
   }
 
   private KnnGraphValues getGraphValues(FieldEntry entry) throws IOException {
+    if (entry.ordOffsetsByLevel == null) {
+      synchronized (entry) {

Review comment:
       @rmuir Thanks for checking.  Good to know that we don't encourage lazy loading, and I will revert this change from this PR, and avoid doing lazy loading in the future.
   
   The problem with the current code is that we load all the graph data into memory in the Reader constructor,  but we may instantiate this reader many times, and for some of these times we don't need the graph data (e.g. if we just do CheckIndex).  But we have a plan to address this to use instead off-heap graph data, so reverting this lazy loading now makes sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on pull request #315:
URL: https://github.com/apache/lucene/pull/315#issuecomment-924778773


   @jtibshirani Thanks for  double checking, that's indeed surprising.  I will do more checks on my side as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r721410089



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -301,55 +349,62 @@ public KnnGraphValues getGraphValues(String field) throws IOException {
       throw new IllegalArgumentException("No such field '" + field + "'");
     }
     FieldEntry entry = fields.get(field);
-    if (entry != null && entry.indexDataLength > 0) {
+    if (entry != null && entry.graphIndexLength > 0) {
       return getGraphValues(entry);
     } else {
       return KnnGraphValues.EMPTY;
     }
   }
 
   private KnnGraphValues getGraphValues(FieldEntry entry) throws IOException {
+    if (entry.ordOffsetsByLevel == null) {
+      synchronized (entry) {

Review comment:
       @rmuir Thanks for checking.  Good to know that we don't encourage lazy loading, and I will revert this change from this PR, and avoid doing lazy loading in the future.
   
   The problem with the current code is that we load all the graph data into memory in the Reader constructor,  why may instantiate this reader many times, and for some of these times we don't need the graph data (e.g. if we just do CheckIndex).  But we have a plan to address this to use instead off-heap graph data.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r721464714



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -301,55 +349,62 @@ public KnnGraphValues getGraphValues(String field) throws IOException {
       throw new IllegalArgumentException("No such field '" + field + "'");
     }
     FieldEntry entry = fields.get(field);
-    if (entry != null && entry.indexDataLength > 0) {
+    if (entry != null && entry.graphIndexLength > 0) {
       return getGraphValues(entry);
     } else {
       return KnnGraphValues.EMPTY;
     }
   }
 
   private KnnGraphValues getGraphValues(FieldEntry entry) throws IOException {
+    if (entry.ordOffsetsByLevel == null) {
+      synchronized (entry) {

Review comment:
       Addressed in b9c4df99f4c4e6c3cd02d336fedf75972778fe53




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r717058689



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -505,13 +571,47 @@ public int nextNeighbor() throws IOException {
     }
 
     @Override
-    public int maxLevel() throws IOException {
-      return 0;
+    public int numOfLevels() throws IOException {
+      return entry.numOfLevels;
     }
 
     @Override
     public int entryNode() throws IOException {
-      return 0;
+      return entryNode;
+    }
+
+    @Override
+    public DocIdSetIterator getAllNodesOnLevel(int level) {
+      return new DocIdSetIterator() {
+        int[] nodes = level == 0 ? null : entry.nodesByLevel.get(level);
+        int numOfNodes = level == 0 ? size() : nodes.length;
+        int idx = -1;
+
+        @Override
+        public int docID() {
+          return level == 0 ? idx : nodes[idx];

Review comment:
       Addressed in 7e60c4d




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #315: Disk write and read of hnsw graph

Posted by GitBox <gi...@apache.org>.

rmuir commented on a change in pull request #315:
URL: https://github.com/apache/lucene/pull/315#discussion_r721421242



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -301,55 +349,62 @@ public KnnGraphValues getGraphValues(String field) throws IOException {
       throw new IllegalArgumentException("No such field '" + field + "'");
     }
     FieldEntry entry = fields.get(field);
-    if (entry != null && entry.indexDataLength > 0) {
+    if (entry != null && entry.graphIndexLength > 0) {
       return getGraphValues(entry);
     } else {
       return KnnGraphValues.EMPTY;
     }
   }
 
   private KnnGraphValues getGraphValues(FieldEntry entry) throws IOException {
+    if (entry.ordOffsetsByLevel == null) {
+      synchronized (entry) {

Review comment:
       > and for some of these times we don't need the graph data (e.g. if we just do CheckIndex).
   
   Wait, that's a different bug. We should open an issue for it. CheckIndex should be "checking" the vectors somehow (even if its just uselessly reading the bytes).
   
   As it documents: "As this tool checks every byte in the index, on a large index it can take quite a long time to run."




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org