You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "ashvardanian (via GitHub)" <gi...@apache.org> on 2023/08/11 16:33:02 UTC

[GitHub] [lucene] ashvardanian opened a new issue, #12502: USearch integration and potential Vector Search performance improvements

ashvardanian opened a new issue, #12502:
URL: https://github.com/apache/lucene/issues/12502

   ### Description
   
   I was recently approached by Lucene and Elastic users, facing low performance and high memory consumption issues, running Vector Search tasks on JVM. Some have also been using native libraries, like our [USearch](https://github.com/unum-cloud/usearch), and were curious if those systems can be combined. Hence, here I am, excited to open a discussion 🤗 
   
   cc @jbellis, @benwtrent, @alessandrobenedetti, @msokolov
   
   ---
   
   I have looked into the existing HNSW implementation and related PR - #10047. The integration should be simple, assuming [we already have a JNI, that passes CI and is hosted on GitHub](https://github.com/unum-cloud/usearch/packages/1867475). The upside would be:
   
   - the performance won't be just on par with FAISS but can be higher.
   - cross-platform `f16` support and `i8` optional automatic downcasting.
   - indexes can be memory-mapped from disk without loading into RAM and are about to receive many `io_uring`-based kernel-bypass tricks, similar to what we have in [UCall](https://github.com/unum-cloud/ucall).
   
   ---
   
   This may automatically resolve the following issues (in reverse chronological order):
   
   - [x] half-precision support: #12403
   - [x] multi-key support: #12313 
   - [x] pluggable metrics, similar to our JIT support in Python: #12219
   - [x] 2K+ dimensional vectors: #11507
   - [x] compact offsets with `uint40_t`: #10884
   - [x] memory consumption: #10177
   
   ---
   
   As far as I understand, it is not common to integrate Lucene with native libraries, but it seems like it can be justified in such computationally-intensive workloads. 
   
   |              | FAISS, `f32` | USearch, `f32` | USearch, `f16` |     USearch, `i8` |
   | :----------- | -----------: | -------------: | -------------: | ----------------: |
   | Batch Insert |       16 K/s |         73 K/s |        100 K/s | 104 K/s **+550%** |
   | Batch Search |       82 K/s |        103 K/s |        113 K/s |  134 K/s **+63%** |
   | Bulk Insert  |       76 K/s |        105 K/s |        115 K/s | 202 K/s **+165%** |
   | Bulk Search  |      118 K/s |        174 K/s |        173 K/s | 304 K/s **+157%** |
   | Recall @ 10  |          99% |          99.2% |          99.1% |             99.2% |
   
   > Dataset: 1M vectors sample of the Deep1B dataset. Hardware: `c7g.metal` AWS instance with 64 cores and DDR5 memory. HNSW was configured with identical hyper-parameters: connectivity `M=16`, expansion @ construction `efConstruction=128`, and expansion @ search `ef=64`. Batch size is 256. Both libraries were compiled for the target architecture.
   
   I am happy to contribute, and looking forward to your comments 🤗


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] ashvardanian commented on issue #12502: USearch integration and potential Vector Search performance improvements

Posted by "ashvardanian (via GitHub)" <gi...@apache.org>.
ashvardanian commented on issue #12502:
URL: https://github.com/apache/lucene/issues/12502#issuecomment-1675201748

   Thank you, @benwtrent, @jbellis, and @uschindler! It's very insightful! [Nmslib.java](https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/util/Nmslib.java) seems like the right place to start.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on issue #12502: USearch integration and potential Vector Search performance improvements

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on issue #12502:
URL: https://github.com/apache/lucene/issues/12502#issuecomment-1675136975

   FYI, if you want to see how others implemented native kNN search using JNI in custom codecs, look here: https://github.com/opensearch-project/k-NN/tree/main/src/main/java/org/opensearch/knn/index (they support faiss and nmslib).
   
   The problems is that you also need to work around non-standard lucene segments due to legacy formats not following the Lucene file format conventions and WORM files. All I/O need to go through the Lucene I/O layers, if it does not work you need workarounds. Those should not live inside Lucene.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on issue #12502: USearch integration and potential Vector Search performance improvements

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on issue #12502:
URL: https://github.com/apache/lucene/issues/12502#issuecomment-1675084211

   Yes:
   - no external libraries for Lucene Core
   - no native code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] benwtrent commented on issue #12502: USearch integration and potential Vector Search performance improvements

Posted by "benwtrent (via GitHub)" <gi...@apache.org>.
benwtrent commented on issue #12502:
URL: https://github.com/apache/lucene/issues/12502#issuecomment-1675079917

   I don't think we need a native implementation. JNI stuff can be dangerous. I honestly don't know the history around Lucene and if there have ever been considerations in the area before. 
   
   I think we should work on making vector search better in Java. We have yet to hit the ceiling here in vector search & index performance in Java and Lucene.
   
   @uschindler what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [I] USearch integration and potential Vector Search performance improvements [lucene]

Posted by "chadbrewbaker (via GitHub)" <gi...@apache.org>.
chadbrewbaker commented on issue #12502:
URL: https://github.com/apache/lucene/issues/12502#issuecomment-1813112623

   > Yes:
   > 
   > * no external libraries for Lucene Core
   > * no native code
   
   Put it in an "examples" directory to show how to extend Lucene with JNI. If you have a $1m spend on Lucene you will figure out JNI issues. As accelerators pop up you will also likely want MOJO native drivers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jbellis commented on issue #12502: USearch integration and potential Vector Search performance improvements

Posted by "jbellis (via GitHub)" <gi...@apache.org>.
jbellis commented on issue #12502:
URL: https://github.com/apache/lucene/issues/12502#issuecomment-1675081860

   Hi Ash,
   
   (1) Have you compared usearch directly with Lucene?  This could be a useful starting point: https://github.com/jbellis/hnswrecall
   
   (2) My understanding is that it is a design goal for Lucene to have zero external dependencies at all, but I'm not a committer so hopefully others will chime in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on issue #12502: USearch integration and potential Vector Search performance improvements

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on issue #12502:
URL: https://github.com/apache/lucene/issues/12502#issuecomment-1675089656

   If you want to integrate it with Lucene:
   - Write your own codec and KnnVectorFormat that uses your library and have it as a separate project. It is easy to plugin in using SPI
   - Don't use JNI and instead use Panama FFI (Java 19+)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org