You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "nknize (via GitHub)" <gi...@apache.org> on 2023/05/15 20:25:13 UTC

[GitHub] [lucene] nknize commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048

nknize commented on PR #874:
URL: https://github.com/apache/lucene/pull/874#issuecomment-1548525928

> ...why is it then that GPT-4, which internally represents each token with a vector of more than 8192, still inaccurately recalls information about entities?

I think this comment actually supports @MarcusSorealheis argument? e.g., What's the point in indexing 8K dimensions if it isn't much better at recall than 768?

> If the real issue is with the use of HNSW, which isn't suitable for this, not that highe-dimensionality embeddings have value, then the solution isn't to not provide the feature, but to switch technologies to something more suitable for the type of applications that people use Lucene for.

I may be wrong but it seems like this is where most of the lucene committers here are settling?

Over a decade ago I wanted a high dimension index for some facial recognition and surveillance applications I was working on. I rejected Lucene at first only because of it being written in java and I personally felt something like C++ was a better fit for the high dimension job (no garbage collection to worry about). So I wrote a high dimension indexer for [MongoDB](https://jira.mongodb.org/browse/SERVER-3551) inspired by RTree (for the record it's implementation is based on XTree) and wrote it using C++ 14 preview features (lambda functions were the new hotness on the block and java didn't even have them yet). Even in C++ back then SIMD wasn't very well supported by the compiler natively so I had to add all sorts of compiler tricks to squeeze every ounce of vector parallelization to make it performant. C++ has gotten better since then but I think java still lags in this area? Even JEP 426 is a ways off (maybe because OpenJDK is holding these this hostage)? So maybe java is still
not the right fit here? I wonder though, does that mean Lucene shouldn't provide dimensionality higher than arbitrary 1024? Maybe not. I agree dimensional reduction techniques like PCA should be considered to reduce the storage volume. The problem with that argument is that dimensionality reduction fails when features are weakly correlated. You can't capture the majority of the signal in the first N components and therefore need higher dimensionality. But does that mean that 1024 is still too low to make Lucene a viable option?

Aside from conjecture does anyone have empirical examples where 1024 is too low and what specific Lucene capabilities (e.g., scoring?) would make adding support for dimensions higher than 1024 really worth considering? If Lucene doesn't do this does it really risk the project becoming irrelevant? That sounds a bit like sensationalism. Even if higher dimensionality is added to the current vector implementation (I'd actually argue we should explore converting BKD to support higher dimensions instead) are we convinced it will perform without JEP 426 or better SIMD support that's only available in newer JDKs? I know Pinecone (and others) [have blogged about their love for RUST](https://www.pinecone.io/learn/rust-rewrite/) for these kinds of applications. Should Lucene just leave this to job of alternative Search APIs? Maybe even something like Tantivy or Rucene?

Interested what others think.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org