You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/05/04 18:16:00 UTC
[jira] [Commented] (LUCENE-9848) Correctly sort HNSW graph neighbors when applying diversity criterion

    [ https://issues.apache.org/jira/browse/LUCENE-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531879#comment-17531879 ] 

ASF subversion and git services commented on LUCENE-9848:
---------------------------------------------------------

Commit dc6a7f94680deba7a5842d35526a966c96faec91 in lucene's branch refs/heads/main from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=dc6a7f94680 ]

LUCENE-9848 Sort HNSW graph neighbors for construction (#862)

* LUCENE-9848 Sort HNSW graph neighbors for construction

Sort HNSW graph neighbors when applying diversity criterion

During HNSW graph construction, when a node has already a number of
connections larger than maximum allowed (maxConn), we need to prune
its connections using a diversity criteria to limit the number of
connections to maxConn.

Currently when we add reverse connections to already existing nodes,
we don't keep them sorted. Thus later, when we apply diversity criteria
we may prune not the worst most distant non-diverse nodes.

This patch makes sure that neighbours connections are always sorted
from best (closest) to worst (distant), and during the application
of diversity criteria processes nodes from worst to best.

This path does the following:
- enhance NeighborArray to always keep neighbour nodes sorted according
  to their scores (in desc or asc order). Make NeighborArray aware in
  which order the nodes should be sorted.
- make OnHeapHnswGraph aware of the order of similarity function
- make HnswGraphBuilder apply diversity criteria from worst to
  best nodes
- create Lucene90NeighborArray to keep the previous logic of
  NeighborArray for Lucene90Codec

> Correctly sort HNSW graph neighbors when applying diversity criterion 
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-9848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9848
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Assignee: Mayya Sharipova
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When indexing new documents in an HNSW graph, we first find its nearest maxConn neighbors (using HNSW search), and then link the new document to this neighbors in the graph. These neighbors are filtered using a diversity test. The neighbors are added one by one, from most similar to least. Each new neighbor is checked against all prior (better) neighbors, and if it is more similar to that neighbor than it is to the target document, it is rejected as insufficiently diverse.
> When we applied this diversity criterion (rather than simply picking the k nearest neighbors), we saw substantial improvements in recall / latency ROC curves across several data sets, and it is part of the reference implementation, too (where we got it). I believe the impact on indexing performance was relatively small; this is a good thing to do, even though it is n^2 at its heart, the n remains reasonable due to being bounded by the maximum graph fanout parameter, {{maxConn}}. 
> Something funny happens when we reach the maximum fanout though. While a new document is being linked to its new neighbors, the neighbors are reciprocally linked to the new document, until their maximum fanout is reached. At that point, the diversity criterion is reapplied to select the neighbors to keep. Basically every neighbor is re-checked against every earlier (better) neighbor to verify the diversity criterion.  This is needed because we haven't really maintained the diversity property while adding these reciprocal links – the initial neighbors are checked for diversity, which often leads to fewer than {{maxConn}} of them being added. Then the new documents get linked in without checking, until {{maxConn}} is reached, and then diversity is checked again. This is kind of weird, but seems to work.
> But the really strange thing is that when we reject non-diverse documents (in HnswGraphBuilder.diversityUpdate), the neighbors are no longer sorted in nearness order. I did some rough checks to see if better graphs would result from re-sorting (so that when there are non-diverse neighbors, we always prefer to drop the worse-scoring one), but it didn't seem to matter all that much. But how can that be?
> At any rate this code is funky and hard to understand, and it would probably benefit from a second look to see if we can either improve indexing performance or improve search performance (by producing better graphs during indexing).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org