You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Derek C <de...@hssl.ie> on 2023/02/28 15:43:09 UTC

How/When does SOLR build KNN HNSW index ?

Hi all,

I have 5 copies/replicas of a SOLR collection with SOLR Cloud & Zookeeper
(SOLR 9.0)

I'm trying to get to the bottom of a strange issue:  Documents that are
updated
with queries to a single SOLR server/endpoint get replicated across all
replicas in the cluster (as you'd expect) and this includes a dense vector
field but, strangely, the KNN dense-vector results from each server/replica
are different
(different topK results when the same query is run against each
replica/server).  Does
this make sense?


I don't really understand the "beamwidth" setting documented in the docs
(so I've
left hnswBeamWidth at the default (40 I think for SOLR 9.0).

I'm also wondering:  Does SOLR KNN search mean it indexes static values at
the
point of indexing (like nearest neighbour friends?) and if data changes over
time (documents removed, documents added) will this affect the KNN
performance?

thanks for any info

Derek

--
Derek Conniffe
Harvey Software Systems Ltd T/A HSSL
Telephone (IRL): 086 856 3823
Telephone (US): (650) 449 6044
Skype: dconnrt
Email: derek@hssl.ie


*Disclaimer:* This email and any files transmitted with it are confidential
and intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please delete it
(if you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of
this information is strictly prohibited).
*Warning*: Although HSSL have taken reasonable precautions to ensure no
viruses are present in this email, HSSL cannot accept responsibility for
any loss or damage arising from the use of this email or attachments.
P For the Environment, please only print this email if necessary.

Re: How/When does SOLR build KNN HNSW index ?

Posted by Kent Fitch <ke...@gmail.com>.

Hi Derek,

On Wed, Mar 1, 2023 at 2:43 AM Derek C <de...@hssl.ie> wrote:

> Hi all,
>
> I have 5 copies/replicas of a SOLR collection with SOLR Cloud & Zookeeper
> (SOLR 9.0)
>
> I'm trying to get to the bottom of a strange issue:  Documents that are
> updated
> with queries to a single SOLR server/endpoint get replicated across all
> replicas in the cluster (as you'd expect) and this includes a dense vector
> field but, strangely, the KNN dense-vector results from each server/replica
> are different
> (different topK results when the same query is run against each
> replica/server).  Does
> this make sense?
>

A different ordering of documents being presented to HNSW can result in a
different graph and hence different results, although HNSW works hard to
mitigate the "different results" part of this.  All things being equal, a
larger beamwidth is likely to reduce these differences,  because a graph
generation process run with a larger beamwidth searches more thoroughly for
nearest neighbours, including diverse nearest neighbours, and hence a large
beamwidth tends towards "better" sets of neighbours, and, in accordance
with the Anna Karenina principle, good sets of neighbours are more alike,
whereas not-so-good sets of neighbours are not-so-good in their own ways.
Also, perhaps the replicas generate slightly different segmenting and
segment merging decisions, again producing different HNSW graphs.

Because searching a HNSW graph returns "approximate nearest neighbours",
what is actually returned depends on a combination of insertion order,
construction "M" and beamwidth settings, segmentation (and I guess sharding
config in the case of SOLR) events, and search beamwidth setting ("k" in
the knn search api).

> I don't really understand the "beamwidth" setting documented in the docs
> (so I've
> left hnswBeamWidth at the default (40 I think for SOLR 9.0).
>

It is worth reviewing this paper https://arxiv.org/pdf/1603.09320.pd and
their discussion of the "efConstruction" parameter, which is the
"beamwidth" parameter used during Lucene HNSW index construction.
Basically, it serves the same function as the "k" parameter when you're
using the knn search api, except during index construction, it is the
"beamwidth" used to find where to insert the new node in the graph and link
it with its nearest neighbours, noting that HNSW has a clever heuristic to
increase the "diversity" of nearest neighbours by not storing nearest
neighbours that aren't very interesting/useful.

I don't know of any reliable formula for setting M and beamwidth, other
than generalisations such as the larger they are the more likely a given
search beamwidth will find higher quality results, and perhaps, the larger
the corpus, the larger they should be.  There's tradeoffs of large
bandwidth means longer construction times, and also large M means "bigger
graph", but in my recent experience of the latest Lucene builds, nearest
neighbours seem to be encoded extremely efficiently in their disk
representations at least and M size doesnt change index size much at all.
The above paper has some pertinent graphs on this subject, but different
use-cases, especially characteristics of clustering of vectors may
complicate simple generalisations, I'm not sure.  I suspect the best
approach is to try various combinations with representative data and
volumes, having good tools to measure the accuracy of recall (such as many
sets of "real" nearest-neighbours for vectors you can query).

> I'm also wondering:  Does SOLR KNN search mean it indexes static values at
> the
> point of indexing (like nearest neighbour friends?) and if data changes
> over
> time (documents removed, documents added) will this affect the KNN
> performance?
>

As with data in all Lucene segments, the HNSW graph structures are "write
once".  So new documents are never added directly to an already written
graph (but will probably be merged as the segment they were written to is
eventually merged as the index grows).  So, each HNSW segment is searched
independently and the results from all segments are combined and ranked by
score.  I guess deleted documents will degrade results until they are
removed by the segment containing them being merged, but I have no
knowledge/experience of how results degrade based on % of deleted
documents, "M" and beamwidth (construction and search) settings.

best regards

Kent Fitch