You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/05/06 18:54:50 UTC

[GitHub] [lucene] mayya-sharipova opened a new pull request, #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

mayya-sharipova opened a new pull request, #872:
URL: https://github.com/apache/lucene/pull/872

   The original HNSW paper (https://arxiv.org/pdf/1603.09320.pdf) suggests
   to use a different maxConn for the upper layers vs. the bottom one
   (which contains the full neighborhood graph). Specifically, they
   suggest using maxConn=M for upper layers and maxConn=2*M for the bottom.
   
   This patch ensures that we follow this recommendation and use
   maxConn=2*M for the bottom layer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #872:
URL: https://github.com/apache/lucene/pull/872#issuecomment-1119925540

   This patch involves format change, but for now I  made changes in Lucene91* files.
   The plan is once another https://github.com/apache/lucene/pull/870 that introduces Lucene92* format is merged to the main branch, I will reorganize this PR to use Lucene92* format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r870464398


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -43,7 +43,8 @@ public final class HnswGraphBuilder {
   /** Random seed for level generation; public to expose for testing * */
   public static long randSeed = DEFAULT_RAND_SEED;
 
-  private final int maxConn;
+  private final int M; // max number of connections on upper layers
+  private final int maxConn0; // max number of connections on the 0th (last) layer

Review Comment:
   Addressed in d16168f77cbeae581a093daa53113bf897ab8e31



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r869226440


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -43,7 +43,8 @@ public final class HnswGraphBuilder {
   /** Random seed for level generation; public to expose for testing * */
   public static long randSeed = DEFAULT_RAND_SEED;
 
-  private final int maxConn;
+  private final int M; // max number of connections on upper layers

Review Comment:
   @mocobeta Thanks for checking, indeed it is good to rename `maxConn` to `M` parameter in the paper, and `maxConn` is ambiguous (as we have two values for `maxConn` now: for upper layers and a lower layer)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on PR #872:
URL: https://github.com/apache/lucene/pull/872#issuecomment-1121379649

   Thanks, this looks the same as what I was seeing now! It's good motivation to add Lucene to ann-benchmarks so we can stop using a custom local benchmark set-up!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mocobeta commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mocobeta commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r868929217


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -43,7 +43,8 @@ public final class HnswGraphBuilder {
   /** Random seed for level generation; public to expose for testing * */
   public static long randSeed = DEFAULT_RAND_SEED;
 
-  private final int maxConn;
+  private final int M; // max number of connections on upper layers

Review Comment:
   Sorry - I missed the previous Julie's comment. If it makes sense to you, I'm fine with it. Maybe we should respect the parameter name in the original paper, like BM25's k and b.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r870464781


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java:
##########
@@ -256,10 +256,11 @@ public void testSearchWithSelectiveAcceptOrds() throws IOException {
 
   public void testSearchWithSkewedAcceptOrds() throws IOException {
     int nDoc = 1000;
+    int maxConn = 16;

Review Comment:
   Good suggestion, I've decided to pull out this parameter. Addressed in d16168f77cbeae581a093daa53113bf897ab8e31



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #872:
URL: https://github.com/apache/lucene/pull/872#issuecomment-1125028616

   @jtibshirani @mocobeta  Thanks for your review. I've changed the current PR based on the recent vector format changes from `main` branch – mostly copy, just in `OnHeapHnswGraph`, I've decided instead of `M` field to have dedicated `nsize` and `nsize0` - neighbours size, so not to calculate them for every node, which also made the code tidier. I hope this is fine.
   
   I plan to merge this PR and backport it to branch_9x today. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #872:
URL: https://github.com/apache/lucene/pull/872#issuecomment-1121144832

   @jtibshirani  Thanks for the comment. I've rerun the benchmarks as you suggested, and here are the new results
   
   ```txt
   
   k            Approach                                                  Recall     QPS
   10           luceneknn dim=100 {'M': 32, 'efConstruction': 100}        0.571     1874.073
   50           luceneknn dim=100 {'M': 32, 'efConstruction': 100}        0.801      752.443
   100          luceneknn dim=100 {'M': 32, 'efConstruction': 100}        0.865      463.214
   500          luceneknn dim=100 {'M': 32, 'efConstruction': 100}        0.959      129.944
   800          luceneknn dim=100 {'M': 32, 'efConstruction': 100}        0.974       87.815
   1000         luceneknn dim=100 {'M': 32, 'efConstruction': 100}        0.980       73.514
   
   10           hnswlib ({'M': 32, 'efConstruction': 100})                0.552    16745.433
   50           hnswlib ({'M': 32, 'efConstruction': 100})                0.794     5738.468
   100          hnswlib ({'M': 32, 'efConstruction': 100})                0.860     3336.386
   500          hnswlib ({'M': 32, 'efConstruction': 100})                0.956      832.982
   800          hnswlib ({'M': 32, 'efConstruction': 100})                0.973      541.097
   1000         hnswlib ({'M': 32, 'efConstruction': 100})                0.979      442.163
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mocobeta commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mocobeta commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r868923104


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -43,7 +43,8 @@ public final class HnswGraphBuilder {
   /** Random seed for level generation; public to expose for testing * */
   public static long randSeed = DEFAULT_RAND_SEED;
 
-  private final int maxConn;
+  private final int M; // max number of connections on upper layers

Review Comment:
   Does it make sense to have a bit more descriptive name for it? Or should we just call `M`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r868594873


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java:
##########
@@ -53,12 +53,14 @@ public final class Lucene91HnswVectorsWriter extends KnnVectorsWriter {
   private final int maxDoc;
 
   private final int maxConn;
+  private final int maxConn0;
   private final int beamWidth;
   private boolean finished;
 
-  Lucene91HnswVectorsWriter(SegmentWriteState state, int maxConn, int beamWidth)
+  Lucene91HnswVectorsWriter(SegmentWriteState state, int maxConn, int maxConn0, int beamWidth)

Review Comment:
   @jtibshirani Thanks for the comment, this is a great suggestion! Addressed in b1a6394402d8d535b092de73b227fb5a40015c65



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on PR #872:
URL: https://github.com/apache/lucene/pull/872#issuecomment-1120061467

   @mayya-sharipova what is the ann-benchmarks set-up that you're using? I found that in order to get comparable results to hnswlib, I had to make sure to adjust `efSearch` to these parameters:
   ```
         run-groups:
           M-4:
             arg-groups:
               - {"M": 32,  "efConstruction": 100}
             query-args: [[1000, 0, 40, 70, 90, 490, 790, 990]]
   ```
   
   First, there is a warm-up run that we throw away. Then we subtract 10 from the other `efSearch` since `KnnGraphTester` adds `k=10` to this value:
   
   ```
     private static TopDocs doKnnVectorQuery(
         IndexSearcher searcher, String field, float[] vector, int k, int fanout) throws IOException {
       return searcher.search(new KnnVectorQuery(field, vector, k + fanout), k);
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mocobeta commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mocobeta commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r870023604


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java:
##########
@@ -377,10 +377,13 @@ private static class FieldEntry {
       for (int level = 0; level < numLevels; level++) {
         if (level == 0) {
           graphOffsetsByLevel[level] = 0;
+        } else if (level == 1) {
+          int numNodesOn0Level = size;

Review Comment:
   minor: `numNodesOnLevel0` might be clearer at the first glance (and consistent with other parts)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #872:
URL: https://github.com/apache/lucene/pull/872#issuecomment-1119928464

   I've run an evaluation:
   baseline: main branch **glove-100-angular M:64 efConstruction:100**
   candidate: this PR **glove-100-angular M:32 efConstruction:100**, **M:64 is used for 0th layer**
   
   baseline: Indexed 1183514 documents in 1087s
   candidate: Indexed 1183514 documents in 1097s
   
   
   |             | baseline recall | baseline QPS | candidate recall | candidate QPS |
   | ----------- | --------------: | -----------: | ---------------: | ------------: |
   | n_cands=10  |           0.686 |     1357.005 |            0.685 |      1356.824 |
   | n_cands=20  |           0.742 |     1078.005 |            0.741 |      1073.286 |
   | n_cands=40  |           0.802 |      783.909 |            0.801 |       779.959 |
   | n_cands=80  |           0.857 |      519.065 |            0.856 |       516.899 |
   | n_cands=120 |           0.885 |      394.007 |            0.884 |       394.255 |
   | n_cands=200 |           0.916 |      270.300 |            0.916 |       267.125 |
   | n_cands=400 |           0.951 |      156.148 |            0.951 |       154.515 |
   | n_cands=600 |           0.966 |      112.399 |            0.966 |       112.545 |
   | n_cands=800 |           0.974 |       88.135 |            0.975 |        87.845 |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r870466493


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##########
@@ -31,6 +31,7 @@
 public final class OnHeapHnswGraph extends HnswGraph {
 
   private final int maxConn;
+  private final int maxConn0;

Review Comment:
   Addressed in d16168f77cbeae581a093daa53113bf897ab8e31



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r870464973


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java:
##########
@@ -377,10 +377,13 @@ private static class FieldEntry {
       for (int level = 0; level < numLevels; level++) {
         if (level == 0) {
           graphOffsetsByLevel[level] = 0;
+        } else if (level == 1) {
+          int numNodesOn0Level = size;

Review Comment:
   Good suggestion! Addressed in d16168f77cbeae581a093daa53113bf897ab8e31



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r869686535


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -43,7 +43,8 @@ public final class HnswGraphBuilder {
   /** Random seed for level generation; public to expose for testing * */
   public static long randSeed = DEFAULT_RAND_SEED;
 
-  private final int maxConn;
+  private final int M; // max number of connections on upper layers
+  private final int maxConn0; // max number of connections on the 0th (last) layer

Review Comment:
   Small comment, I think it'd be clearer to just store the parameter `M` and compute `maxConn0` on the fly. It's such a simple calculation it doesn't really help to store it.



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -68,42 +69,43 @@ public final class HnswGraphBuilder {
    *
    * @param vectors the vectors whose relations are represented by the graph - must provide a
    *     different view over those vectors than the one used to add via addGraphNode.
-   * @param maxConn the number of connections to make when adding a new graph node; roughly speaking
-   *     the graph fanout.
+   * @param M the number of connections to make when adding a new graph node; roughly speaking the

Review Comment:
   I guess we should update this to mention that the last layer is treated differently (it uses max 2*M connections?)



##########
lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java:
##########
@@ -256,10 +256,11 @@ public void testSearchWithSelectiveAcceptOrds() throws IOException {
 
   public void testSearchWithSkewedAcceptOrds() throws IOException {
     int nDoc = 1000;
+    int maxConn = 16;

Review Comment:
   I guess this should be called `M`? Or maybe we don't even need to pull out this parameter?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mocobeta commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mocobeta commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r870035060


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##########
@@ -31,6 +31,7 @@
 public final class OnHeapHnswGraph extends HnswGraph {
 
   private final int maxConn;
+  private final int maxConn0;

Review Comment:
   I guess we could have just `M` here too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mocobeta commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mocobeta commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r870030770


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -43,7 +43,8 @@ public final class HnswGraphBuilder {
   /** Random seed for level generation; public to expose for testing * */
   public static long randSeed = DEFAULT_RAND_SEED;
 
-  private final int maxConn;
+  private final int M; // max number of connections on upper layers
+  private final int maxConn0; // max number of connections on the 0th (last) layer

Review Comment:
   I'd agree with this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mocobeta commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mocobeta commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r868929217


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -43,7 +43,8 @@ public final class HnswGraphBuilder {
   /** Random seed for level generation; public to expose for testing * */
   public static long randSeed = DEFAULT_RAND_SEED;
 
-  private final int maxConn;
+  private final int M; // max number of connections on upper layers

Review Comment:
   Sorry - I missed the previous Julie's comment. If it makes sense to you, I'm fine with it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r868255093


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java:
##########
@@ -53,12 +53,14 @@ public final class Lucene91HnswVectorsWriter extends KnnVectorsWriter {
   private final int maxDoc;
 
   private final int maxConn;
+  private final int maxConn0;
   private final int beamWidth;
   private boolean finished;
 
-  Lucene91HnswVectorsWriter(SegmentWriteState state, int maxConn, int beamWidth)
+  Lucene91HnswVectorsWriter(SegmentWriteState state, int maxConn, int maxConn0, int beamWidth)

Review Comment:
   I was thinking we could just keep a single configuration parameter here, and internally calculate `maxConn0 = 2 * M`. If we allow it to be passed as a parameter, it seems like it's important to be able to configure it, but that's not the case (it is not something users will change and should always be set to `2 * M`). Like that we could also avoid writing a new value `maxConn0` into the format, which doesn't seem necessary?
   
   If we are worried about naming, we could rename `maxConn` to `M`. From my perspective, it's okay to use single-letter variable names (with a clear comment!) when it directly corresponds to a paper's algorithm.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r870463506


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -68,42 +69,43 @@ public final class HnswGraphBuilder {
    *
    * @param vectors the vectors whose relations are represented by the graph - must provide a
    *     different view over those vectors than the one used to add via addGraphNode.
-   * @param maxConn the number of connections to make when adding a new graph node; roughly speaking
-   *     the graph fanout.
+   * @param M the number of connections to make when adding a new graph node; roughly speaking the

Review Comment:
   Addressed in d16168f77cbeae581a093daa53113bf897ab8e31



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mocobeta commented on a diff in pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mocobeta commented on code in PR #872:
URL: https://github.com/apache/lucene/pull/872#discussion_r868923104


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -43,7 +43,8 @@ public final class HnswGraphBuilder {
   /** Random seed for level generation; public to expose for testing * */
   public static long randSeed = DEFAULT_RAND_SEED;
 
-  private final int maxConn;
+  private final int M; // max number of connections on upper layers

Review Comment:
   Does it make sense to have a bit more descriptive name for it? Or should we just call this `M`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova merged pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mayya-sharipova merged PR #872:
URL: https://github.com/apache/lucene/pull/872


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #872:
URL: https://github.com/apache/lucene/pull/872#issuecomment-1123937553

   @jtibshirani  @mocobeta Thanks for your review. I've addressed your latest comments in d16168f77cbeae581a093daa53113bf897ab8e31.
   
   The plan for merging this PR is following: once [other changes](https://github.com/apache/lucene/pull/880) to vectors' format are merged to `main`, I will change this PR based on `main`, that all these changes will go to `Lucene92` vector format files instead of the current `Lucene91` vector format files.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #872: LUCENE-10527 Use 2*maxConn for last layer in HNSW

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #872:
URL: https://github.com/apache/lucene/pull/872#issuecomment-1119937491

   Also here is the comparison of recall with hnswlib (results from hnwlib are copied from the [Jira issue](https://issues.apache.org/jira/browse/LUCENE-10527), and not run on my machine)
   
   ```txt
   
   k            Approach                                                  Recall     QPS
   10           luceneknn dim=100 {'M': 32, 'efConstruction': 100}        0.685     1372.022
   50           luceneknn dim=100 {'M': 32, 'efConstruction': 100}        0.820      680.472
   100          luceneknn dim=100 {'M': 32, 'efConstruction': 100}        0.872      438.112
   500          luceneknn dim=100 {'M': 32, 'efConstruction': 100}        0.960      126.510
   800          luceneknn dim=100 {'M': 32, 'efConstruction': 100}        0.975       87.441
   1000         luceneknn dim=100 {'M': 32, 'efConstruction': 100}        0.980       73.193
   
   10           hnswlib ({'M': 32, 'efConstruction': 100})                0.552    16745.433
   50           hnswlib ({'M': 32, 'efConstruction': 100})                0.794     5738.468
   100          hnswlib ({'M': 32, 'efConstruction': 100})                0.860     3336.386
   500          hnswlib ({'M': 32, 'efConstruction': 100})                0.956      832.982
   800          hnswlib ({'M': 32, 'efConstruction': 100})                0.973      541.097
   1000         hnswlib ({'M': 32, 'efConstruction': 100})                0.979      442.163
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org