You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/09/06 10:57:00 UTC

[jira] [Commented] (LUCENE-10040) Handle deletions in nearest vector search

    [ https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410531#comment-17410531 ] 

ASF subversion and git services commented on LUCENE-10040:
----------------------------------------------------------

Commit bc161e6dcc0d200c115992eda43cd7d06074184e in lucene's branch refs/heads/main from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=bc161e6 ]

LUCENE-10040 Correct TestHnswGraph.testSearchWithAcceptOrds (#277)

If we set numSeed = 10, this test fails sometimes  because it may mark
expected results docs (from 0 to 9) as deleted which don't end up
being retrieved, resulting in a low recall

- set numSeed to 10 to ensure 10 results are returned
- add startIndex paramenter to createRandomAcceptOrds that allows
  documents before startIndex to be NOT deleted
- use startIndex equal to 10 for createRandomAcceptOrds

Relates to #239

> Handle deletions in nearest vector search
> -----------------------------------------
>
>                 Key: LUCENE-10040
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10040
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Julie Tibshirani
>            Assignee: Julie Tibshirani
>            Priority: Major
>          Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Currently nearest vector search doesn't account for deleted documents. Even if a document is not in {{LeafReader#getLiveDocs}}, it could still be returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be surprising + difficult for users, since other search APIs account for deleted docs. We've discussed extending the search logic to take a parameter like {{Bits liveDocs}}. This issue discusses options around adding support.
> One approach is to just filter out deleted docs after running the KNN search. This behavior seems hard to work with as a user: fewer than {{k}} docs might come back from your KNN search!
> Alternatively, {{LeafReader#searchNearestVectors}} could always return the {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs while assembling its candidate list. It would traverse further into the graph, visiting more nodes to ensure it gathers the required candidates. (Note deleted docs would still be visited/ traversed). The [hnswlib library|https://github.com/nmslib/hnswlib] contains an implementation like this, where you can mark documents as deleted and they're skipped during search.
> This approach seems reasonable to me, but there are some challenges:
>  * Performance can be unpredictable. If deletions are random, it shouldn't have a huge effect. But in the worst case, a segment could have 50% deleted docs, and they all happen to be near the query vector. HNSW would need to traverse through around half the entire graph to collect neighbors.
>  * As far as I know, there hasn't been academic research or any testing into how well this performs in terms of recall. I have a vague intuition it could be harder to achieve high recall as the algorithm traverses areas further from the "natural" entry points. The HNSW paper doesn't mention deletions/ filtering, and I haven't seen community benchmarks around it.
> Background links:
>  * Thoughts on deletions from the author of the HNSW paper: [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892]
>  * Blog from Vespa team which mentions combining KNN and search filters (very similar to applying deleted docs): [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. The "Exact vs Approximate" section shows good performance even when a large percentage of documents are filtered out. The team mentioned to me they didn't have the chance to measure recall, only latency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org