You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "alessandrobenedetti (via GitHub)" <gi...@apache.org> on 2023/05/19 15:30:04 UTC

[GitHub] [lucene] alessandrobenedetti opened a new issue, #12313: Multi-value Support for KnnVectorField

alessandrobenedetti opened a new issue, #12313:
URL: https://github.com/apache/lucene/issues/12313

   ### Description
   
   It would be nice to support multiple values in a Knn vector field.
   This must be compatible with both the Exact and Approximate Nearest Neighbor search.
   
   There are two sides to the coin:
   
   1) Index time support - allowing to add in the indexing data structures multiple vectors for the same field and docID
   2) Query time support - how to retrieve a topK list of documents, where each document my have multiple neighbors to the query
   
   The problem is more complicated than it seems.
   
   An initial tentative design and draft implementation is attached
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] HoustonPutman commented on issue #12313: Multi-value Support for KnnVectorField

Posted by "HoustonPutman (via GitHub)" <gi...@apache.org>.

HoustonPutman commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-1611546574

   @alessandrobenedetti's [Berlin Buzzwords talk](https://www.youtube.com/watch?v=KhL0NrGj0uE) gave a pretty good example. If you want to have individual vectors for each paragraph, then you would either need to split your document up per-paragraph, or you would need a multi-valued field for your paragraph vectors.
   
   But I can imagine other usages as well. There are users who store lots of information for people in each document to represent a group. So names, emails, etc. If you have vectorized each person for personalization, then that document would also need a vector per person.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on issue #12313: Multi-value Support for KnnVectorField

Posted by "msokolov (via GitHub)" <gi...@apache.org>.

msokolov commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-1611270462

   I see a lot of good work on the implementation in the attached PR, great! What I'm lacking though is any understanding of what the use cases for this might be. Do we have some? I think it's important to have at least some envisioned so we can know how to go with the implementation since there will undoubtedly be tradeoffs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Multi-value Support for KnnVectorField [lucene]

Posted by "vigyasharma (via GitHub)" <gi...@apache.org>.

vigyasharma commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-2045706297

   @benwtrent : ++, I've been thinking on similar lines in the context of e-commerce type applications where different vectors represent different aspects of a document. The scorer can do a weighted dot-product across different vectors.
   
   I like the wider generalization here, of using max/min/sum/... aggregations. Another thing to consider is that the multi-vector scorer will also need to handle similarity computations between nodes during graph build. For e.g. if the aggregation is `max`, would we need to compute  distance between `n x n` vectors and then take the max?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on issue #12313: Multi-value Support for KnnVectorField

Posted by "alessandrobenedetti (via GitHub)" <gi...@apache.org>.

alessandrobenedetti commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-1612857327

   I'll have to spend more brain time on the proposed block-join alternative, but isn't it already "available" in that form? (with the consequent problems and benefits of joins?)
   When I have more time I'll give a more informed opinion!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] uschindler commented on issue #12313: Multi-value Support for KnnVectorField

Posted by "uschindler (via GitHub)" <gi...@apache.org>.

uschindler commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-1611685208

   I have a customer using Solr to do kNN for trademark images. Each trademark has several images and they want to find te trademark with closest imae match (cosine distance). They currently use some outdated homemade plugin using BinaryDocvalues and  scan through them, but to speed up search kNN looks like right Choice. But to do facetting they want hits counted on the trademarks and not on the images.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Multi-value Support for KnnVectorField [lucene]

Posted by "vigyasharma (via GitHub)" <gi...@apache.org>.

vigyasharma commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-2044076948

   What are some use-cases for multi-valued vectors that are not easily supported using parent-child block joins? 
   
   I'd like to contribute here, trying to understand our main requirements given we now have parent-child joins in knn. I suppose block joins require all child documents with each update. Is that the main overhead we'd like to avoid with multi-valued vectors? Are there query time scenarios that block joins cannot fulfill?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] benwtrent commented on issue #12313: Multi-value Support for KnnVectorField

Posted by "benwtrent (via GitHub)" <gi...@apache.org>.

benwtrent commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-1612016406

So, I have been thinking of the current implementation and was wondering if we could instead move towards using the `join` functionality?

Just to make sure I am not absolutely crazy.

- `join` already requires children and parent documents to be indexed next to each other (parent docs as the last doc in the child&parent block).
- When searching the graph, a separate kind of `NeighborQueue` that requires topK parent documents (ParentNeighboQueue?). This queue would require a `BitSet` of the parent doc ids. Then when a child doc is visited via the graph, we can check its score and determine the parent block via `BitSet#nextSetBit`. Tracking how many parents we have visited and their scores wouldn't be too bad from that avenue.
- Top scoring documents would be the scores according to the parent docs. I guess this COULD be flexible allowing `mean`, `min`, `max` discovered child during the graph explore.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Multi-value Support for KnnVectorField [lucene]

Posted by "benwtrent (via GitHub)" <gi...@apache.org>.

benwtrent commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-2045604262

   I do think things like `ColBERT` would benefit from having multiple vectors for a single document field.
   
   One crazy idea I had (others have probably already thought of this, and found it wanting...) is since HNSW supports non-euclidean space already, what if HNSW graph nodes simply represented more than one vector?
   
   Then the flat storage system and underlying scorer could handle the distance computations and HNSW itself doesn't actually have to change. 
   
   I could see this maybe hurting recall, but I wonder in practice how bad it would actually hurt things.
   
   The idea would be:
   
    - A new FlatVectorFormat type that allows more than one vector (or possibly extending the existing ones)
    - That type would provide a scorer to HNSW that resolves the multi-vector scores by providing a particular aggregation of the scores of the vectors. This could be "max", "min", "avg", "sum" or something.
    - Then we need to test how recall is for the graph for individual vectors as a query could be one vector (regular passage search) or multiple (ColBERT).
   
   HNSW doesn't actually look at the vectors at all, it simply provides an ordinal and requests a score, so the change in regards to code wouldn't be too bad I think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Multi-value Support for KnnVectorField [lucene]

Posted by "alessandrobenedetti (via GitHub)" <gi...@apache.org>.

alessandrobenedetti commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-1831534483

   Hi @david-sitsky, the multi-valued vectors in Lucene's contribution is now paused for lack of fundings.
   I'll resume it from my side if and when I get some sponsors :)
   
   The nested documents approach on the other hand has been released with Lucene 9.8! You read the various considerations that apply in the thread!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Multi-value Support for KnnVectorField [lucene]

Posted by "benwtrent (via GitHub)" <gi...@apache.org>.

benwtrent commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-2045722371

   >  if the aggregation is max, would we need to compute distance between n x n vectors and then take the max?
   
   Correct, I would even want flexibility between what was used to build the graph vs. how we query it. 
   
   This may be a good enough trade-off against recall. I would hope that adding additional connections in the graph would off-set any recall differences we would see when combining vectors in a single node vs. each vector being its own node. All this requires testing and prototyping, I just haven't had the time to dig deeply.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] uschindler commented on issue #12313: Multi-value Support for KnnVectorField

Posted by "uschindler (via GitHub)" <gi...@apache.org>.

uschindler commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-1612506692

I would still prefer to have multiple values per document. From the point of view of implementation this does not look crazy to me, but using blockjoins adds too many limitations and often people don't want to use it for other reasons

The implementation as suggested by @alessandrobenedetti looks great to me and goes in line with other multivalued fields in Lucene, just my comments after watching his talk and skimming through th PR:
- the general storage implementation of the storage of vectors is basically similar to SortedSetDocValues (see also @msokolov initial implementation which solely used DocValues). The change here is SortedDocValues to SortedSetDocvalues. We may keep a separate single valued implementation and offer a wrapper (like for docvalues).
- the index to find nearest neigbours (HNSW) does not need any change because the grpah entries just point to ordinal numbers. We just need to take care that the number of ordinal numbers may go beyond Integer.MAX_VALUE
- result collection is different because we need to apply the min/max/avg functions. To me this is the most complicated change, but this would be similarily complex with block join.

I think the biggest problem of the current PR is that ordinals need to be "long" as the number of vectors may go beyond Integer.MAX_VALUE.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Multi-value Support for KnnVectorField [lucene]

Posted by "alessandrobenedetti (via GitHub)" <gi...@apache.org>.

alessandrobenedetti commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-2044685997

   Hi, I gave a talk about this at Berlin Buzzwords where I touched on the motivations:
   https://www.youtube.com/watch?v=KhL0NrGj0uE
   In short:
   - multi-valued vectors will add feature parity with most of the others field types (multi-valued is supported for many field types)
   - nested docs bring various considerations in terms of both performance and the necessity of aggregating at different levels


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] benwtrent commented on issue #12313: Multi-value Support for KnnVectorField

Posted by "benwtrent (via GitHub)" <gi...@apache.org>.

benwtrent commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-1611575334

   There are also late-interaction-models that do embeddings per token. While the current HNSW codec wouldn't be best for that, it is another use case for multiple embeddings per document.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] benwtrent commented on issue #12313: Multi-value Support for KnnVectorField

Posted by "benwtrent (via GitHub)" <gi...@apache.org>.

benwtrent commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-1612949170

   > I'll have to spend more brain time on the proposed block-join alternative, but isn't it already "available" in that form? (with the consequent problems and benefits of joins?)
   
   The key issue is document collection. Right now, the `topK` is limited to only `topK` children documents. Really, what you want is the `topK` parent documents based on children scores.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Multi-value Support for KnnVectorField [lucene]

Posted by "david-sitsky (via GitHub)" <gi...@apache.org>.

david-sitsky commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-1831197772

   > The key issue is document collection. Right now, the `topK` is limited to only `topK` children documents. Really, what you want is the `topK` parent documents based on children scores.
   
   Just curious, has there been any progress with multi-value vector field support?
   
   Elasticsearch seem to support the idea of using the "block join" approach as outlined in this blog post from a couple of weeks ago: https://www.elastic.co/search-labs/blog/articles/chunking-via-ingest-pipelines, although from what I can see, it will suffer from the same issue @benwtrent mentions where the topK will be applied to child documents.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org