You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael Sokolov (Jira)" <ji...@apache.org> on 2020/09/25 13:52:01 UTC

[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

    [ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202181#comment-17202181 ] 

Michael Sokolov commented on LUCENE-9004:
-----------------------------------------

[~jim.ferenczi] thanks for your insightful comments! 

I've been working on an implementation again; this is basically a stripped-down (single layer; not hierarchical) version of [~tomoko]'s most recent patch. I started from that patch, but it has evolved quite a bit so I think it should get shared as a different thing. Here are some noteable differences:

# The graph is built on heap only while flushing/merging and then is written to disk and afterwards searched by reading it from a file,with minimal on-heap data structures.
# It has a simpler API that hides the HNSW graph reading/writing as an internal detail of the vectors format.
# Nodes in the graph are identified by dense ordinals rather than docids. This allows for direct indexing of the vectors during graph search and leads to some efficiency when reading/writing the graph since friend lists are always built in sorted order. Graph ordinals must be mapped to/from docids, but only at the API boundary.

I tried to do as little as possible to get a viable NSW vector KNN search that we can build on/improve; it's not perfect, but builds on the clean APIs / new index format previously developed in this patch, does achieve reasonable performance in my tests, and I think can be a starting point. With that in mind, here are some big TODOs I left for future iterations:

# Doesn't yet handled sorted indexes when merging
# No handling of deleted docs while searching (they would be merged away, but docs deleted in the current segment are still included in search).
# Doesn't impose maximum fanout on the graph (it's always symmetric right now)
# Takes a simple-minded approach to manage the graph while indexing - per-segment graph, which is more costly to search than a single index-wide graph (or some hybrid) as pointed out above. I think we can find a way to have a graph span multiple segments, but leave this for later.
# Nothing done re: expanding the search to handle filtering due to deletions and/or other query constraints).
# Some data structures can be optimized for sure (there are Map<Integer, Long> that would be better unboxed, etc)

A few of those (deleted docs, sorted index handling) really need to be addressed right away, but I think this can at least be tested for performance/suitability of the API without them. Some of the other issues require exciting work that improves the state of the art. EG to expand the search after filtering, we should be able to make the search re-entrant by maintaining the visited-neighbors set. On the subject of more efficient merging, we may be able to merge two existing graphs by laying connections between them. Also - I think there is an opportunity to build the graph concurrently on multiple threads. This is a bit tricky and may sacrifice some recall depending on how it's done, but could be a good trade-off to get better indexing speed. But this is all in the future - to start with we need something basic that works.

So I'd like to make some progress here by getting something basic committed to "master" (aside: are we going to rename? I see github at least is moving to "main"?) that we can all iterate on and improve.

The patch is kind of big (around 5K LOC) so I think it needs to be broken up in order to be committed safely. A lot of the change is boilerplate required to introduce a new index format, so I hope we can commit that first to get it out of the way. My plan is to break it into two main commits:

# Add the VectorValues format along the lines of LUCENE-9322. I plan to pick up the discussion there and propose a simple implementation. This would get us vector storage and retrieval, but no efficient KNN search implementation. This change will include a lot of the plumbing that touches a lot of files, but so long as we are agreed that adding the format is a good idea, should be fairly straightforward :)
# Add the NSW/KNN search implementation. I think we can do this without any significant public API changes. It will basically be an implementation of {VectorValues.search(float[] target, int k, int fanout)}, supported by a graph built when flushing the vector field.
# Make it better! We can have more smaller issues once something basic is in place.

Small aside, [~jim.ferenczi] re cost of the graph:

> We also need to keep the nearest neighbor distances for each neighbor so the total cost is N*M*8

I think this is not so for the NSW algorithm; you just keep the docids, so N*M*4 is the cost. Still this doesn't change any of the scaling conclusions. 

> Approximate nearest vector search
> ---------------------------------
>
>                 Key: LUCENE-9004
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9004
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael Sokolov
>            Priority: Major
>         Attachments: hnsw_layered_graph.png
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing terms, queries and documents is becoming a must-have feature for a modern search engine. SOLR-12890 is exploring various approaches to this, including providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers have found an approach based on navigating a graph that partially encodes the nearest neighbor relation at multiple scales can provide accuracy > 95% (as compared to exact nearest neighbor calculations) at a reasonable cost. This issue will explore implementing HNSW (hierarchical navigable small-world) graphs for the purpose of approximate nearest vector search (often referred to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a graph that has a partial encoding of the nearest neighbor relation, with some short and some long-distance links. If this graph is built in the right way (has the hierarchical navigable small world property), then you can efficiently traverse it to find nearest neighbors (approximately) in log N time where N is the number of nodes in the graph. I believe this idea was pioneered in  [1]. The great insight in that paper is that if you use the graph search algorithm to find the K nearest neighbors of a new document while indexing, and then link those neighbors (undirectedly, ie both ways) to the new document, then the graph that emerges will have the desired properties.
> The implementation I propose for Lucene is as follows. We need two new data structures to encode the vectors and the graph. We can encode vectors using a light wrapper around {{BinaryDocValues}} (we also want to encode the vector dimension and have efficient conversion from bytes to floats). For the graph we can use {{SortedNumericDocValues}} where the values we encode are the docids of the related documents. Encoding the interdocument relations using docids directly will make it relatively fast to traverse the graph since we won't need to lookup through an id-field indirection. This choice limits us to building a graph-per-segment since it would be impractical to maintain a global graph for the whole index in the face of segment merges. However graph-per-segment is a very natural at search time - we can traverse each segments' graph independently and merge results as we do today for term-based search.
> At index time, however, merging graphs is somewhat challenging. While indexing we build a graph incrementally, performing searches to construct links among neighbors. When merging segments we must construct a new graph containing elements of all the merged segments. Ideally we would somehow preserve the work done when building the initial graphs, but at least as a start I'd propose we construct a new graph from scratch when merging. The process is going to be  limited, at least initially, to graphs that can fit in RAM since we require random access to the entire graph while constructing it: In order to add links bidirectionally we must continually update existing documents.
> I think we want to express this API to users as a single joint {{KnnGraphField}} abstraction that joins together the vectors and the graph as a single joint field type. Mostly it just looks like a vector-valued field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many nocommits, basic design is not really set, there is no Query implementation and no integration iwth IndexSearcher, but it does work by some measure using a standalone test class. I've tested with uniform random vectors and on my laptop indexed 10K documents in around 10 seconds and searched them at 95% recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I haven't made any attempt to use multithreaded search for this, but it is amenable to per-segment concurrency.
> [1] [https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164]
>  
> *UPDATES:*
>  * (1/12/2020) The up-to-date branch is: [https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org