You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/05/09 15:17:47 UTC

[GitHub] [lucene] mayya-sharipova opened a new pull request, #874: LUCENE-10471 Increse max dims for vectors to 2048

mayya-sharipova opened a new pull request, #874:
URL: https://github.com/apache/lucene/pull/874

   Increase the maximum number of dims for KNN vectors to 2048.
   
   The current maximum allowed number of dimensions is equal to 1024.
   But we see in practice a number of models that produce vectors with > 1024
   dimensions, especially for image encoding (e.g mobilenet_v2 uses
    1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors).
   Increasing max dims to `2048` will satisfy these use cases.
   
   We will not recommend further increase of vector dims.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] MarcusSorealheis commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048

Posted by GitBox <gi...@apache.org>.

MarcusSorealheis commented on PR #874:
URL: https://github.com/apache/lucene/pull/874#issuecomment-1286509849

   Should we punish and exclude customers who cannot complete requisite steps of dimensional reduction or allow them to explore with very expensive compute. Many popular large language models surpass the current threshold for better or worse. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #874:
URL: https://github.com/apache/lucene/pull/874#issuecomment-1122395334

   My concerns are on the JIRA issue, I don't want them to be forgotten. https://issues.apache.org/jira/browse/LUCENE-10471
   
   I don't know how we can say "we will not recommend further increase". What happens when the latest trendy dataset comes out with 4096 dimensions?
   
   I want to understand, why so many dimensions are really needed for search purposes. What is the concrete benefit in terms of quality, because we know what the performance hit is going to be.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] veqtor commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048

Posted by "veqtor (via GitHub)" <gi...@apache.org>.

veqtor commented on PR #874:
URL: https://github.com/apache/lucene/pull/874#issuecomment-1517620551

   > willing to take actions that go against science because vendors have told them it is right
   
   If, as you say, an entire document, regardless of it's lenght, content and so on, can be represented by a vector of 768 floats, why is it then that GPT-4, which internally represents each token with a vector of more than 8192, still inaccurately recalls information about entities?
   
   Do you see the flaw in your reasoning here?
   
   If the real issue is with the use of HNSW, which isn't suitable for this, not that highe-dimensionality embeddings have value, then the solution isn't to not provide the feature, but to switch technologies to something more suitable for the type of applications that people use Lucene for: Search over large amounts of data.
   
   If you need this functionality then you have no reason to use anything else than FAISS.
   HNSW works ok, but only if you use it for max 500 or so embeddings, then it becomes too slow.
   Using FAISS you can hierarchically partition the vector space and all calculations are done efficiently.
   
   If bringing in FAISS is too drastical, then it's implementation should be studied and integrated instead.
   
   Fast efficient vector functionality is a must, if lucene doesn't support this then it and anything that builds off of it is doomed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048

Posted by GitBox <gi...@apache.org>.

uschindler commented on PR #874:
URL: https://github.com/apache/lucene/pull/874#issuecomment-1295866157

   Please don't do this. If somebody is not able to reduce the number of dimensions before indxing the stuff heshe should also not use vector search at all because it will just produce huge indexes that are slow like hell. If you understand your data you can also reduce dimensions. If not, it is the wrong tool for you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048

Posted by GitBox <gi...@apache.org>.

mocobeta commented on PR #874:
URL: https://github.com/apache/lucene/pull/874#issuecomment-1121814970

   I'm curious about how such large models (to me) are practically common or will be common in the near future (in the IR area).
   I don't have enough expertise to agree or disagree - it's just a general (and maybe naive) question.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] nknize commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048

Posted by "nknize (via GitHub)" <gi...@apache.org>.

nknize commented on PR #874:
URL: https://github.com/apache/lucene/pull/874#issuecomment-1548525928

> ...why is it then that GPT-4, which internally represents each token with a vector of more than 8192, still inaccurately recalls information about entities?

I think this comment actually supports @MarcusSorealheis argument? e.g., What's the point in indexing 8K dimensions if it isn't much better at recall than 768?

> If the real issue is with the use of HNSW, which isn't suitable for this, not that highe-dimensionality embeddings have value, then the solution isn't to not provide the feature, but to switch technologies to something more suitable for the type of applications that people use Lucene for.

I may be wrong but it seems like this is where most of the lucene committers here are settling?

Over a decade ago I wanted a high dimension index for some facial recognition and surveillance applications I was working on. I rejected Lucene at first only because of it being written in java and I personally felt something like C++ was a better fit for the high dimension job (no garbage collection to worry about). So I wrote a high dimension indexer for [MongoDB](https://jira.mongodb.org/browse/SERVER-3551) inspired by RTree (for the record it's implementation is based on XTree) and wrote it using C++ 14 preview features (lambda functions were the new hotness on the block and java didn't even have them yet). Even in C++ back then SIMD wasn't very well supported by the compiler natively so I had to add all sorts of compiler tricks to squeeze every ounce of vector parallelization to make it performant. C++ has gotten better since then but I think java still lags in this area? Even JEP 426 is a ways off (maybe because OpenJDK is holding these this hostage)? So maybe java is still
not the right fit here? I wonder though, does that mean Lucene shouldn't provide dimensionality higher than arbitrary 1024? Maybe not. I agree dimensional reduction techniques like PCA should be considered to reduce the storage volume. The problem with that argument is that dimensionality reduction fails when features are weakly correlated. You can't capture the majority of the signal in the first N components and therefore need higher dimensionality. But does that mean that 1024 is still too low to make Lucene a viable option?

Aside from conjecture does anyone have empirical examples where 1024 is too low and what specific Lucene capabilities (e.g., scoring?) would make adding support for dimensions higher than 1024 really worth considering? If Lucene doesn't do this does it really risk the project becoming irrelevant? That sounds a bit like sensationalism. Even if higher dimensionality is added to the current vector implementation (I'd actually argue we should explore converting BKD to support higher dimensions instead) are we convinced it will perform without JEP 426 or better SIMD support that's only available in newer JDKs? I know Pinecone (and others) [have blogged about their love for RUST](https://www.pinecone.io/learn/rust-rewrite/) for these kinds of applications. Should Lucene just leave this to job of alternative Search APIs? Maybe even something like Tantivy or Rucene?

Interested what others think.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] MarcusSorealheis commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048

Posted by GitBox <gi...@apache.org>.

MarcusSorealheis commented on PR #874:
URL: https://github.com/apache/lucene/pull/874#issuecomment-1295871431

   Neither of you are wrong. In this case, we have a world of people excited about a new thing, willing to take actions that go against science because vendors have told them it is right. While I am personally confident that the number of dimensions that is useful for the search use case ought not exceed 768, ithe hard and fast rule boxes us out a fabulous amount of explorational compute. I never want Lucene to be perceived as legacy software. 
   
   
   On this point, I will stand down, especially because if users want to change they can. We’re open source. Reliability and performance of the unchanged system are more important.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #874:
URL: https://github.com/apache/lucene/pull/874#issuecomment-1295819265

   i dont agree, I think the problems are flaws with the HNSW and can't be worked around. Its too slow already at 768 and in fact the current limit overpromises and underdelivers by allowing you to even do this.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] MarcusSorealheis commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048

Posted by GitBox <gi...@apache.org>.

MarcusSorealheis commented on PR #874:
URL: https://github.com/apache/lucene/pull/874#issuecomment-1295674939

   I think slow indexing throughput is a pain that customers ought to surface. If they find that they mostly use vectors for use cases that don't have nrt-scaling and replication requirements that should drive our decision to inhibit the maximum number of dimensions. 
   
   I have seen multiple Open AI and Hugging Face customers flock to other search engines because we impose this limit. 4096 is the number that keeps getting thrown but have seen one case of more. 
   
   On the other hand, if there are stability concerns at a particular level of dimensionality, we should cap there. All customers don't have equivalent needs for indexing throughput. 
   
   Plus — we can work on indexing throughput in the future as an incremental improvement to the feature. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #874:
URL: https://github.com/apache/lucene/pull/874#issuecomment-1286914129

   the performance with e.g. 768 is incredibly painful. hours and hours to index just 1M documents. Already doesn't scale with the current limit! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] ctlgcustodio commented on pull request #874: LUCENE-10471 Increse max dims for vectors to 2048

Posted by GitBox <gi...@apache.org>.

ctlgcustodio commented on PR #874:
URL: https://github.com/apache/lucene/pull/874#issuecomment-1131711152

   > My concerns are on the JIRA issue, I don't want them to be forgotten. https://issues.apache.org/jira/browse/LUCENE-10471
   > 
   > I don't know how we can say "we will not recommend further increase". What happens when the latest trendy dataset comes out with 4096 dimensions?
   > 
   > I want to understand, why so many dimensions are really needed for search purposes. What is the concrete benefit in terms of quality, because we know what the performance hit is going to be.
   
   
   I understand that in general the more features you have in a vector of embeddings, the more details the model returns from the classification.
   So you have a more refined result.
   However, while it does not support greater than 1024, if possible use a weighted average and evaluate your result.
   
   In my case I used Fixed Average and it worked fine for Elmo model, as mentioned here in 3 Alternative Weighting Schemes
   https://arxiv.org/pdf/1904.02954.pdf
   
   [Other option](https://github.com/lior-k/fast-elasticsearch-vector-scoring) If I'm not mistaken this git is capable of supporting vectors larger than 1024.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org