You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/09/29 00:56:36 UTC

[GitHub] [lucene] jtibshirani opened a new issue, #11830: Store HNSW graph connections more compactly

jtibshirani opened a new issue, #11830:
URL: https://github.com/apache/lucene/issues/11830

   ### Description
   
   HNSW search is most efficient when all vector data fits in page cache. So good to keep the size of vector files as small as possible.
   
   We currently write all HNSW graph connections as fixed-size integers. This is wasteful since most graphs have far fewer nodes than the max integer value:
   https://github.com/apache/lucene/blob/d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956/lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94HnswVectorsWriter.java#L478
   
   Maybe instead we could store the connection list using `PackedInts.Writer`. This would decrease the bits needed to store each connection. We could still ensure that every connection list takes the same number of bytes, to continue being able to index into the graph data easily.
   
   I quickly tested a version of the idea on the 1 million vector GloVe dataset, and saw the graph data size decrease by ~30%:
   
   ```
   Baseline
   103M	luceneknn-100-16-100.train-16-100.index/_6_Lucene94HnswVectorsFormat_0.vex
   
   Hacky patch
   155M	luceneknn-100-16-100.train-16-100.index/_5_Lucene94HnswVectorsFormat_0.vex
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] benwtrent commented on issue #11830: Store HNSW graph connections more compactly

Posted by GitBox <gi...@apache.org>.

benwtrent commented on issue #11830:
URL: https://github.com/apache/lucene/issues/11830#issuecomment-1279207529

OK, I ran this patch against minst, sift, glove, and deep image. Recall is the exact same in all cases, so no mistakes there.

There are indeed slight differences in latency, below are images. I ran search 5 times for each parameter combination and chose the fastest. All of these were with `"M": 16, "efConstruction": 100`.

For each, I specifically point out the QPS at the lowest recall.

The deltas in QPS vary from near 0% change to 11% decrease. 11% does seem large but I think its an outlier. The median decrease is 2%, the average change is 0% (or 0.002%).
[ann_qps_comparison - ann_scores.csv](https://github.com/apache/lucene/files/9788126/ann_qps_comparison.-.ann_scores.csv)

For every data set, the memory required reduced at around a similar amount.

This change seems like a good step forward to me.

@jtibshirani

# Deep Image

![deep-image-96-angular-batch](https://user-images.githubusercontent.com/4357155/195885355-70fa0c17-23bb-4389-af70-f7c3e5668f64.png)

# Glove

![glove-100-angular-batch](https://user-images.githubusercontent.com/4357155/195885357-0776bc5e-ed4d-41af-9109-6a3764c3ba5d.png)

# Minst

![mnist-784-euclidean-batch](https://user-images.githubusercontent.com/4357155/195885361-c3267fd0-8a79-4184-953c-a6cc41f1f7dd.png)

# Sift

![sift-128-euclidean-batch](https://user-images.githubusercontent.com/4357155/195885362-fa71d343-f75f-4d51-9dfc-b202e75f319a.png)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on issue #11830: Store HNSW graph connections more compactly

Posted by GitBox <gi...@apache.org>.

msokolov commented on issue #11830:
URL: https://github.com/apache/lucene/issues/11830#issuecomment-1262415650

   I like the idea, although I confess I don't understand how we can make it fixed width! I guess if we know the max number and it is small, we can quantize more cheaply?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz closed issue #11830: Store HNSW graph connections more compactly

Posted by GitBox <gi...@apache.org>.

jpountz closed issue #11830: Store HNSW graph connections more compactly
URL: https://github.com/apache/lucene/issues/11830


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] benwtrent commented on issue #11830: Store HNSW graph connections more compactly

Posted by GitBox <gi...@apache.org>.

benwtrent commented on issue #11830:
URL: https://github.com/apache/lucene/issues/11830#issuecomment-1270205020

I was able to replicate @jtibshirani results.

Using the `glove-100-angular` from ann-benchmarks, here is the graph data size:

```
Baseline:
154M _0_Lucene94HnswVectorsFormat_0.vex

PackedInts:
103M _0_Lucene94HnswVectorsFormat_0.vex
```

Now, the implementation for `PackedInts` followed closely to @jtibshirani "hacky patch". There are various changes required on read, and new information needs to be stored in regards to the PackedInts version used. In all (disregarding the fact that a new codec must be written), the code change is pretty small.

Other good news is that when running ann-benchmark on `glove-100-angular`, there is no discernible difference in search performance. I am not sure how much traffic the off-heap reader gets in this benchmark. It seems to be it would be read once and stored in memory, though I might be misunderstanding something.

@jtibshirani @mayya-sharipova what do y'all think? Any ideas on how to force off-heap for search for better numbers on search changes?

Honestly, we may not care about slightly slower off-heap search performance as smaller sizes are more likely to be in the page cache.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on issue #11830: Store HNSW graph connections more compactly

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on issue #11830:
URL: https://github.com/apache/lucene/issues/11830#issuecomment-1270359029

   @benwtrent thanks for looking into this! To clarify, `OffHeapHnswGraph` will always be used for searches. Usually it will be paged in from disk before heavy searching, but it's still the same data structure. The on-heap version is only used for indexing.
   
   Could you include a comparison of the search latency and recall numbers you see with this approach? Sometimes with our benchmarks it's easy to miss small differences in performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] benwtrent commented on issue #11830: Store HNSW graph connections more compactly

Posted by GitBox <gi...@apache.org>.

benwtrent commented on issue #11830:
URL: https://github.com/apache/lucene/issues/11830#issuecomment-1319099313

   I changed the PR to move towards delta encoding & vint. Even with storing the memory offsets within `vex`, the storage improvements are much better than PackedInts.
   
   Table with some numbers around the size improvements for different data sets & parameters:
   
   | packed_vex_mb_size | vex_mb_size | packed_index_build_time | index_build_time | params                             | dataset             | percent_reduction |
   |--------------------|-------------|-------------------------|------------------|------------------------------------|---------------------|-------------------|
   | 79.9               | 161.6       | 767                     | 784              | "{'M': 16, 'efConstruction': 100}" | glove-100-angular   | 50.55693069       |
   | 108.4              | 464.1       | 1138                    | 1225             | "{'M': 48, 'efConstruction': 100}" | glove-100-angular   | 76.64296488       |
   | 2.3                | 8.2         | 36                      | 36               | "{'M': 16, 'efConstruction': 100}" | mnist-784-euclidean | 71.95121951       |
   | 2.4                | 23.5        | 36                      | 36               | "{'M': 48, 'efConstruction': 100}" | mnist-784-euclidean | 89.78723404       |
   | 66.1               | 392.2       | 501                     | 572              | "{'M': 48, 'efConstruction': 100}" | sift-128-euclidean  | 83.1463539        |
   | 59.7               | 136.6       | 449                     | 516              | "{'M': 16, 'efConstruction': 100}" | sift-128-euclidean  | 56.29575403       |
   
   
   For the curious, here are the QPS numbers (higher is better) for packed (delta & vint) vs baseline:
   
   # Glove
   
   ![image](https://user-images.githubusercontent.com/4357155/202539450-415f1622-cf6f-4cc6-8de5-e714b47cc8a6.png)
   
   # MNist
   
   ![image](https://user-images.githubusercontent.com/4357155/202539516-235485b1-9b01-497f-81af-ce2d7475ae74.png)
   
   # SIFT
   
   ![image](https://user-images.githubusercontent.com/4357155/202539592-d3c387e2-60e2-4956-8e92-b5b9361588bb.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on issue #11830: Store HNSW graph connections more compactly

Posted by GitBox <gi...@apache.org>.

msokolov commented on issue #11830:
URL: https://github.com/apache/lucene/issues/11830#issuecomment-1320183235

   Hey this looks great! Awesome to see the storage gains with no loss in
   query time
   
   On Thu, Nov 17, 2022 at 2:25 PM Benjamin Trent ***@***.***>
   wrote:
   
   > I changed the PR to move towards delta encoding & vint. Even with storing
   > the memory offsets within vex, the storage improvements are much better
   > than PackedInts.
   >
   > Table with some numbers around the size improvements for different data
   > sets & parameters:
   > packed_vex_mb_size vex_mb_size packed_index_build_time index_build_time
   > params dataset percent_reduction
   > 79.9 161.6 767 784 "{'M': 16, 'efConstruction': 100}" glove-100-angular
   > 50.55693069
   > 108.4 464.1 1138 1225 "{'M': 48, 'efConstruction': 100}" glove-100-angular
   > 76.64296488
   > 2.3 8.2 36 36 "{'M': 16, 'efConstruction': 100}" mnist-784-euclidean
   > 71.95121951
   > 2.4 23.5 36 36 "{'M': 48, 'efConstruction': 100}" mnist-784-euclidean
   > 89.78723404
   > 66.1 392.2 501 572 "{'M': 48, 'efConstruction': 100}" sift-128-euclidean
   > 83.1463539
   > 59.7 136.6 449 516 "{'M': 16, 'efConstruction': 100}" sift-128-euclidean
   > 56.29575403
   >
   > For the curious, here are the QPS numbers (higher is better) for packed
   > (delta & vint) vs baseline:
   > Glove
   >
   > [image: image]
   > <https://user-images.githubusercontent.com/4357155/202539450-415f1622-cf6f-4cc6-8de5-e714b47cc8a6.png>
   > MNist
   >
   > [image: image]
   > <https://user-images.githubusercontent.com/4357155/202539516-235485b1-9b01-497f-81af-ce2d7475ae74.png>
   > SIFT
   >
   > [image: image]
   > <https://user-images.githubusercontent.com/4357155/202539592-d3c387e2-60e2-4956-8e92-b5b9361588bb.png>
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/lucene/issues/11830#issuecomment-1319099313>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAHHUQO5Y2R3FCC3FD6TM4TWI2BD7ANCNFSM6AAAAAAQYJCK7E>
   > .
   > You are receiving this because you commented.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on issue #11830: Store HNSW graph connections more compactly

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on issue #11830:
URL: https://github.com/apache/lucene/issues/11830#issuecomment-1262311806

   Nice idea, @jtibshirani! 
   Have you tested what's the performance of reading this way packed values during search?  Does it make searches any slower? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] benwtrent commented on issue #11830: Store HNSW graph connections more compactly

Posted by GitBox <gi...@apache.org>.

benwtrent commented on issue #11830:
URL: https://github.com/apache/lucene/issues/11830#issuecomment-1270415680

   > Could you include a comparison of the search latency and recall numbers you see with this approach? Sometimes with our benchmarks it's easy to miss small differences in performance.
   
   Yes, I can provide some images and numbers. What I saw when running `glove-100-angular` batched is that there is no difference. Will run more tests for sure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on issue #11830: Store HNSW graph connections more compactly

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on issue #11830:
URL: https://github.com/apache/lucene/issues/11830#issuecomment-1279577436

   Thanks @benwtrent , the results look promising! It seems like a good time to open a PR. We might even have ideas for optimizations once we see how the change looks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org