You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/09/02 13:59:26 UTC

[GitHub] [lucene] mayya-sharipova opened a new pull request, #11743: LUCENE-10592 Better estimate memory for HNSW graph

mayya-sharipova opened a new pull request, #11743:
URL: https://github.com/apache/lucene/pull/11743

   Better estimate memory used for OnHeapHnswGraph,
   as well as add tests.
   
   Also don't over-allocate arrays in NeighborArray.
   
   Relates to #992


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #11743:
URL: https://github.com/apache/lucene/pull/11743#issuecomment-1236990320

   Woops I just reread @msokolov 's comment and it was not actually about buffering on disk, more about moving them to disk before starting to build the graph. Sorry for the confusion. I thought it was about writing them to disk at index time. But now I wonder: could we buffer vectors on disk with the approach of building the graph during indexing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r962071131


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##########
@@ -104,8 +104,8 @@ public void removeLast() {
   }
 
   public void removeIndex(int idx) {
-    System.arraycopy(node, idx + 1, node, idx, size - idx);
-    System.arraycopy(score, idx + 1, score, idx, size - idx);
+    System.arraycopy(node, idx + 1, node, idx, size - idx - 1);

Review Comment:
   Thank you for bringing my attention to this, I checked, and indeed `length=0` is permissible in `System.arraycopy` function. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r962072588


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##########
@@ -175,20 +175,28 @@ public long ramBytesUsed() {
     long neighborArrayBytes0 =
         nsize0 * (Integer.BYTES + Float.BYTES)
             + RamUsageEstimator.NUM_BYTES_ARRAY_HEADER * 2
-            + RamUsageEstimator.NUM_BYTES_OBJECT_REF;
+            + RamUsageEstimator.NUM_BYTES_OBJECT_REF
+            + Integer.BYTES * 2;
     long neighborArrayBytes =
         nsize * (Integer.BYTES + Float.BYTES)
             + RamUsageEstimator.NUM_BYTES_ARRAY_HEADER * 2
-            + RamUsageEstimator.NUM_BYTES_OBJECT_REF;
-
+            + RamUsageEstimator.NUM_BYTES_OBJECT_REF
+            + Integer.BYTES * 2;
     long total = 0;
     for (int l = 0; l < numLevels; l++) {
       int numNodesOnLevel = graph.get(l).size();

Review Comment:
   In comparison with what `RamUsageTester::ramUsed` reports, we are actually underestimating. I still need to do more investigation.
   Also, we preemptively allocate memory for all `NeighborArray` objects which may never get populated with real data.  We probably need to look into this. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #11743:
URL: https://github.com/apache/lucene/pull/11743#issuecomment-1238235494

   @msokolov 
   > Although ... the RAM needed for the graph was always required, even when building the graph during flush, it just wasn't accounted for I think. I suppose a possible way to improve the buffering situation would be to buffer the vectors in RAM and then on merge, write them out, freeing the on-heap copy, and while building the graph, access the vectors from disk
   
   Indeed, that how we do that in Lucene 9.3, [using off-heap vector values to build graph](https://github.com/apache/lucene/blob/branch_9_3/lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsWriter.java#L148-L154)  
   
   The problem with this is that building graph on flush would take a lot of time, which makes searches that needed updated changes unpredictably long. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
msokolov commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r961747487


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##########
@@ -104,8 +104,8 @@ public void removeLast() {
   }
 
   public void removeIndex(int idx) {
-    System.arraycopy(node, idx + 1, node, idx, size - idx);
-    System.arraycopy(score, idx + 1, score, idx, size - idx);
+    System.arraycopy(node, idx + 1, node, idx, size - idx - 1);

Review Comment:
   oh, good! Is it OK when `size - idx - 1` == 0? I think so, at least I checked javadocs and they say only length < 0 raises an error



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##########
@@ -175,20 +175,28 @@ public long ramBytesUsed() {
     long neighborArrayBytes0 =
         nsize0 * (Integer.BYTES + Float.BYTES)
             + RamUsageEstimator.NUM_BYTES_ARRAY_HEADER * 2
-            + RamUsageEstimator.NUM_BYTES_OBJECT_REF;
+            + RamUsageEstimator.NUM_BYTES_OBJECT_REF
+            + Integer.BYTES * 2;
     long neighborArrayBytes =
         nsize * (Integer.BYTES + Float.BYTES)
             + RamUsageEstimator.NUM_BYTES_ARRAY_HEADER * 2
-            + RamUsageEstimator.NUM_BYTES_OBJECT_REF;
-
+            + RamUsageEstimator.NUM_BYTES_OBJECT_REF
+            + Integer.BYTES * 2;
     long total = 0;
     for (int l = 0; l < numLevels; l++) {
       int numNodesOnLevel = graph.get(l).size();

Review Comment:
   were we overestimating before because not all the nodes were populated?



##########
lucene/core/src/test/org/apache/lucene/util/TestRamUsageEstimator.java:
##########
@@ -222,6 +229,33 @@ public void testPrintValues() {
     System.out.println("LONG_SIZE = " + LONG_SIZE);
   }
 
+  public void testHnswGraph() throws IOException {
+    int size = atLeast(2000);
+    int dim = randomIntBetween(100, 1024);
+    int M = randomIntBetween(4, 96);
+    VectorSimilarityFunction similarityFunction =
+        VectorSimilarityFunction.values()[
+            random().nextInt(VectorSimilarityFunction.values().length - 1) + 1];
+    VectorEncoding vectorEncoding;
+    if (similarityFunction == VectorSimilarityFunction.DOT_PRODUCT) {

Review Comment:
   I don't think you need this check -- it's OK now to use BYTE encoding with other similarities



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #11743:
URL: https://github.com/apache/lucene/pull/11743#issuecomment-1236414077

   @msokolov Thanks for your feedback.
   
   I have done a similar analysis comparing 9.3 branch with this change on `glove-100-angular` 1 million documents,  M:16 efConstruction:100
   
   **Results with 9.3:**
   
   IndexingChaing::ramBytesUsed() reports  497089200 bytes or **497MB**
   
   **Results with current change**
   
   IndexingChaing::ramBytesUsed() reports  memory vectors: 497075904; memory graph: 379073392; so total: **876 Mb**
   
   
   So, you are right, much more memory used during indexing. We need at least 16 * M * number_nodes:
   -  2 *M neighbours for each node on the lowest level * 8 bytes ( 4 bytes for neighbour node number + 4 bytes for neighbour score)
   - so indeed, if indexing memory buffer is set up less than that, we would end up with much more segments which is not desirable.
   
   
   ----
   > I wonder if we should consider rolling back the "build graph during indexing" change? It seems to make indexing take > 10% longer and of course requires more RAM, which will tend to make more and smaller segments; not a desirable outcome.
   
   Thanks for suggestion.  I will discuss this with our team on Tuesday, and will get back to you.
   
   One thing I wonder we did not observe longer total indexing time (combined indexing + refresh time). Is combined total indexing time + refresh time became larger for you?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
msokolov commented on PR #11743:
URL: https://github.com/apache/lucene/pull/11743#issuecomment-1236325605

   I ran a test using KnnGraphTester, setting IndexWriterConfig max buffer size to 250MB. I indexed 1M 256-dim docs (1K each, so 1GB in total) Results with 9.3:
   
   ```
   4 segments
   recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
   0.978    2.36   1000000 50      32      64      3977    476926
   ```
   
   With 9.x, 7 segments got created
   
   ```
   recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
   0.983    4.05   1000000 50      32      64      60      522040  1.00    post-filter
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
msokolov commented on PR #11743:
URL: https://github.com/apache/lucene/pull/11743#issuecomment-1236326177

   I wonder if we should consider rolling back the "build graph during indexing" change? It seems to make indexing take > 10% longer and of course requires more RAM, which will tend to make more and smaller segments; not a desirable outcome.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #11743:
URL: https://github.com/apache/lucene/pull/11743#issuecomment-1238777470

   @msokolov  Thanks for your feedback, and good ideas, we can experiment with them.
   
   We've discussed within our team (including @jpountz  and @jtibshirani) and decided that we still would like to proceed with this change (building graph on indexing) for 9.4 release, as  we see benefits outweigh extra memory used.  For example, we see significant improvement in refresh time in our Elasticsearch benchmarks, which makes searches on updated data much faster and predictable:
   
   <img width="822" alt="image" src="https://user-images.githubusercontent.com/5738841/188762407-0ecaaa9b-71df-4290-83db-630bfd01e5bd.png">
   
   ----
   
   For follow-ups (beyond 9.4 release), we can experiment with :
   - pre-flush flush (flushing a portion of vectors)
   -  using data with less precision than integers and floats for storing neighbours' info in graph 
   -  some other ideas? 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova merged pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
mayya-sharipova merged PR #11743:
URL: https://github.com/apache/lucene/pull/11743


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r966265508


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##########
@@ -46,7 +46,7 @@ public NeighborArray(int maxSize, boolean descOrder) {
    * nodes.
    */
   public void add(int newNode, float newScore) {
-    if (size == node.length - 1) {
+    if (size == node.length) {
       node = ArrayUtil.grow(node, (size + 1) * 3 / 2);

Review Comment:
   Great comment! Addressed in 45ced1e5beac8eb9932933967f3a2ca00d0c7835



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
jpountz commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r962569779


##########
lucene/core/src/test/org/apache/lucene/util/TestRamUsageEstimator.java:
##########
@@ -222,6 +229,28 @@ public void testPrintValues() {
     System.out.println("LONG_SIZE = " + LONG_SIZE);
   }
 
+  public void testHnswGraph() throws IOException {
+    int size = atLeast(2000);
+    int dim = randomIntBetween(100, 1024);
+    int M = randomIntBetween(4, 96);
+    VectorSimilarityFunction similarityFunction =
+        VectorSimilarityFunction.values()[
+            random().nextInt(VectorSimilarityFunction.values().length - 1) + 1];
+    VectorEncoding vectorEncoding =
+        VectorEncoding.values()[random().nextInt(VectorEncoding.values().length - 1) + 1];
+    TestHnswGraph.RandomVectorValues vectors =
+        new TestHnswGraph.RandomVectorValues(size, dim, vectorEncoding, random());
+
+    HnswGraphBuilder<?> builder =
+        HnswGraphBuilder.create(
+            vectors, vectorEncoding, similarityFunction, M, M * 2, random().nextLong());
+    OnHeapHnswGraph hnsw = builder.build(vectors.copy());
+    long estimated = RamUsageEstimator.sizeOfObject(hnsw);
+    long actual = ramUsed(hnsw);
+
+    assertEquals((double) actual, (double) estimated, (double) actual * 0.3);
+  }
+

Review Comment:
   Nit: Since this test is about testing `OnHeapHnswGraph#ramBytesUsed` rather than `RamUsageEstimator`, it should be  in `TestHnswGraph`.



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##########
@@ -46,7 +46,7 @@ public NeighborArray(int maxSize, boolean descOrder) {
    * nodes.
    */
   public void add(int newNode, float newScore) {
-    if (size == node.length - 1) {
+    if (size == node.length) {
       node = ArrayUtil.grow(node, (size + 1) * 3 / 2);

Review Comment:
   I wonder if this line is intentional. `ArrayUtil#grow` already increases the provided size by 12.5%. And it looks like there is an intention here to increase the size by 50%. So overall this increases the size of the array by 68.75%. Should we either do `ArrayUtil.growExact(node, (size + 1) * 3 / 2)` to grow by 50% or possibly rely on `ArrayUtil`'s default by doing `ArrayUtil.grow(node)` and grow by 12.5%?



##########
lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java:
##########
@@ -74,12 +74,8 @@ public void setup() {
     similarityFunction =
         VectorSimilarityFunction.values()[
             random().nextInt(VectorSimilarityFunction.values().length - 1) + 1];
-    if (similarityFunction == VectorSimilarityFunction.DOT_PRODUCT) {
-      vectorEncoding =
-          VectorEncoding.values()[random().nextInt(VectorEncoding.values().length - 1) + 1];
-    } else {
-      vectorEncoding = VectorEncoding.FLOAT32;
-    }
+    vectorEncoding =
+        VectorEncoding.values()[random().nextInt(VectorEncoding.values().length - 1) + 1];

Review Comment:
   I think you can do something like `RandomPicks.randomFrom(random(), VectorEncoding.values())` to make it more readable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
msokolov commented on PR #11743:
URL: https://github.com/apache/lucene/pull/11743#issuecomment-1236890998

   > One thing I wonder we did not observe longer total indexing time (combined indexing + refresh time). Did combined total indexing time + refresh time became larger for you?
   
   The KnnGraphTester indexing time numbers count open a new IndexWriter, add all the documents, close the IndexWriter; there's no refresh.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dweiss commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
dweiss commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r962606272


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java:
##########
@@ -74,12 +74,8 @@ public void setup() {
     similarityFunction =
         VectorSimilarityFunction.values()[
             random().nextInt(VectorSimilarityFunction.values().length - 1) + 1];
-    if (similarityFunction == VectorSimilarityFunction.DOT_PRODUCT) {
-      vectorEncoding =
-          VectorEncoding.values()[random().nextInt(VectorEncoding.values().length - 1) + 1];
-    } else {
-      vectorEncoding = VectorEncoding.FLOAT32;
-    }
+    vectorEncoding =
+        VectorEncoding.values()[random().nextInt(VectorEncoding.values().length - 1) + 1];

Review Comment:
   Or RandomizedTest.randomFrom(VectorEncoding.values()) - there should be a method that delegates to the current context.
   
   The above code seems to pick from the range of [1, len) - not sure if this is intentional but zero index will never happen.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
msokolov commented on PR #11743:
URL: https://github.com/apache/lucene/pull/11743#issuecomment-1236894645

   > I wonder if we should consider rolling back the "build graph during indexing" change? It seems to make indexing take > 10% longer and of course requires more RAM, which will tend to make more and smaller segments; not a desirable outcome.
   
   Although ... the RAM needed for the graph was always required, even when building the graph during flush, it just wasn't accounted for I think. I suppose a possible way to improve the buffering situation would be to buffer the vectors in RAM and then on merge, write them out, freeing the on-heap copy, and while building the graph, access the vectors from disk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
msokolov commented on PR #11743:
URL: https://github.com/apache/lucene/pull/11743#issuecomment-1238330802

   OK, thanks for the reminder of the arguments for moving the graph creation to index time.
   
   > May be, we can come up with some sophisticated solution, writing vector values in batches to several files, but not sure if this complexity worth it.
   
   Right, we could buffer up to some % of indexwriter buffer size in RAM, and then write to a (list of) temporary file(s), freeing RAM and thenceforth accumulating new writes in RAM. Kind of like a pre-flush flush? Reading would require a wrapper that presents this all as a single VectorValues. It is more complex, but seems like it could be worthwhile since it will help reduce the pressure on the index writer to flush "prematurely," and this HNSW stuff is sensitive to being fragmented.
   
   The current situation is not terrible; eventually, merging should improve the index geometry. I don't think we have a blocker to release. At any rate for typical use cases I have in mind, the index size is still dominated by other types of fields and this is unlikely to be a problem. Although for a vectors-only index it looks worse, I think that exaggerates the typical impact? Not sure how it is looking from other perspective though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r966264433


##########
lucene/core/src/test/org/apache/lucene/util/TestRamUsageEstimator.java:
##########
@@ -222,6 +229,28 @@ public void testPrintValues() {
     System.out.println("LONG_SIZE = " + LONG_SIZE);
   }
 
+  public void testHnswGraph() throws IOException {
+    int size = atLeast(2000);
+    int dim = randomIntBetween(100, 1024);
+    int M = randomIntBetween(4, 96);
+    VectorSimilarityFunction similarityFunction =
+        VectorSimilarityFunction.values()[
+            random().nextInt(VectorSimilarityFunction.values().length - 1) + 1];
+    VectorEncoding vectorEncoding =
+        VectorEncoding.values()[random().nextInt(VectorEncoding.values().length - 1) + 1];
+    TestHnswGraph.RandomVectorValues vectors =
+        new TestHnswGraph.RandomVectorValues(size, dim, vectorEncoding, random());
+
+    HnswGraphBuilder<?> builder =
+        HnswGraphBuilder.create(
+            vectors, vectorEncoding, similarityFunction, M, M * 2, random().nextLong());
+    OnHeapHnswGraph hnsw = builder.build(vectors.copy());
+    long estimated = RamUsageEstimator.sizeOfObject(hnsw);
+    long actual = ramUsed(hnsw);
+
+    assertEquals((double) actual, (double) estimated, (double) actual * 0.3);
+  }
+

Review Comment:
   Addressed in 45ced1e5beac8eb9932933967f3a2ca00d0c7835



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r966265177


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java:
##########
@@ -74,12 +74,8 @@ public void setup() {
     similarityFunction =
         VectorSimilarityFunction.values()[
             random().nextInt(VectorSimilarityFunction.values().length - 1) + 1];
-    if (similarityFunction == VectorSimilarityFunction.DOT_PRODUCT) {
-      vectorEncoding =
-          VectorEncoding.values()[random().nextInt(VectorEncoding.values().length - 1) + 1];
-    } else {
-      vectorEncoding = VectorEncoding.FLOAT32;
-    }
+    vectorEncoding =
+        VectorEncoding.values()[random().nextInt(VectorEncoding.values().length - 1) + 1];

Review Comment:
   Great feedback! Thanks Adrien & Dawid!  Addressed in 45ced1e5beac8eb9932933967f3a2ca00d0c7835



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #11743:
URL: https://github.com/apache/lucene/pull/11743#issuecomment-1236975827

   I like the idea of exploring a combination of the current approach and on-disk buffering to flush less often.
   
   For the record, the approach of building the graph at flush time has a few other downsides that are not well captured by an indexing benchmark. Mike mentioned the fact that we use a similar amount of memory at flush time (though it's more transient), but there is also the logic we have for stalling that waits until flush segments + buffered segments use 2x the size of the RAM buffer before stalling indexing. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java#L286-L294 Because flushes take very long times when building the graph on search, it's more likely that IndexWriter goes over (up to 2x) the amount of RAM that it's allowed to spend on the indexing buffer (which could be surprising on its own to users, could cause OOMEs) and indexing gets stalled (which can be surprising to users as well). Maybe getting rid of this downside is worth losing a bit of indexing throughput.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r962070814


##########
lucene/core/src/test/org/apache/lucene/util/TestRamUsageEstimator.java:
##########
@@ -222,6 +229,33 @@ public void testPrintValues() {
     System.out.println("LONG_SIZE = " + LONG_SIZE);
   }
 
+  public void testHnswGraph() throws IOException {
+    int size = atLeast(2000);
+    int dim = randomIntBetween(100, 1024);
+    int M = randomIntBetween(4, 96);
+    VectorSimilarityFunction similarityFunction =
+        VectorSimilarityFunction.values()[
+            random().nextInt(VectorSimilarityFunction.values().length - 1) + 1];
+    VectorEncoding vectorEncoding;
+    if (similarityFunction == VectorSimilarityFunction.DOT_PRODUCT) {

Review Comment:
   Thanks, good to know. I copied that part from  another test in `TestHnswGraph`. I've corrected here as well in as in that test. Addressed in 2e01deab2e1678b4687250d22f62ad0cd6196bc9



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #11743:
URL: https://github.com/apache/lucene/pull/11743#issuecomment-1238251543

   @jpountz  
   > could we buffer vectors on disk with the approach of building the graph during indexing?
   
   We explored this, but could not find any way to do that.  To build a graph, we need to access to all vector values indexed so far.  If vectors are buffered in memory, this works.  But if we are to buffer vectors on disk, we can’t at the same time write vectors to this file and read vectors from it during the graph construction, as reading from unclosed index outputs is not possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org