You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael Sokolov (Jira)" <ji...@apache.org> on 2020/09/29 16:09:01 UTC
[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

    [ https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204091#comment-17204091 ] 

Michael Sokolov commented on LUCENE-9322:
-----------------------------------------

I posted a PR addressing that builds on the discussion and earlier PR's from [~jtibshirani] and [~tomoko] and would appreciate your review if you have time. Just to address some of the recent discussion here:

1. This is for dense vectors only. I think handling sparse vectors is potentially interesting, but would require a completely different approach, so I think should be done separately.
2. I would like to see if we can completely hide the ANN implementation behind the vector API, as Julie initially proposed, making the selection of an algorithm a simple parameter of VectorValues. In the soon-to-come NSW graph implementation I have in mind there is no new graph format, just another auxiliary index file inside the vector format. To that end, I included both L2 and dot-product distances with the idea of maintaining something in the API that enables control over the underlying KNN implementation. EG we could have ScoreFunction overloaded with graph algorithm? Maybe it's too much, I'd like feedback on this part.


> Discussing a unified vectors format API
> ---------------------------------------
>
>                 Key: LUCENE-9322
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9322
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Julie Tibshirani
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being developed, one based on HNSW (LUCENE-9004) and another based on coarse quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to handle vectors. In LUCENE-9136 we discussed the possibility of a unified API that could support both approaches. The two ANN strategies give different trade-offs in terms of speed, memory, and complexity, and it’s likely that we’ll want to support both. Vector search is also an active research area, and it would be great to be able to prototype and incorporate new approaches without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The prototype for coarse quantization ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit soon (this depends on everyone's feedback of course). The approach is simple and shows solid search performance, as seen [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326]. I think this API discussion is an important step in moving that implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org