You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Derek C <de...@hssl.ie> on 2023/03/15 22:59:27 UTC

KNN HNSW - How does "indexing" and "updating" work ?

Hi all,

This is still something I just don't understand and am very confused about
: -

With SOLR and KNN dense vector search :

How, or when, does SOLR update/refresh it's HNSW table ?  When
documents are added and deleted for example how, and when, does SOLR
"refresh" the KNN search?  I also don't understand how it works when even
adding the first initial documents with dense vector embeddings (but it
works because we can execute KNN searches).  In this regard it seems very
different than the "normal" index

Is the HNSW table fully in-memory ?  If a SOLR node is stopped and started
does it take time to do a rebuild of the HNSW in-memory table and will the
potential results from a newly created HNSW table differ from other nodes?
(I'm pretty sure we've seen that this is at least true when we've added new
nodes to a cluster, and those nodes sync the documents, and the new nodes
are returning different KNN results to the original nodes).

thanks for any info

Derek

--
Derek Conniffe
Harvey Software Systems Ltd T/A HSSL
Telephone (IRL): 086 856 3823
Telephone (US): (650) 449 6044
Skype: dconnrt
Email: derek@hssl.ie


*Disclaimer:* This email and any files transmitted with it are confidential
and intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please delete it
(if you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of
this information is strictly prohibited).
*Warning*: Although HSSL have taken reasonable precautions to ensure no
viruses are present in this email, HSSL cannot accept responsibility for
any loss or damage arising from the use of this email or attachments.
P For the Environment, please only print this email if necessary.

Re: KNN HNSW - How does "indexing" and "updating" work ?

Posted by Kent Fitch <ke...@gmail.com>.

Hi Derek,

AFAIK and based on my limited experience using Lucene's HNSW
implementation, the HNSW index works much the same as any other Lucene
index: its on-disk data structures are stored with all other disk based
information about the documents it is describing in a segment, and all
files are "write once".

From time to time (and forced by a commit), the on-disk representations of
new records will be written to a new segment.  Over time, segments will be
merged, replacing the merged segments with a new segment.  As with other
indices, the HNSW representation for the merged segments will be replaced
with a new HNSW graph representation which is written to the new segment.

In this respect, I don't think HNSW indices are different from other
indices.

Perhaps what is noticeably different is that HNSW implements an approximate
nearest neighbour (ANN) capability, trading-off recall for search time.
Hence, as segments are merged, you can see different search results,
particularly if you are using a lowish search api "k" parameter (the search
beamwidth) and even more so if you have a very large index.  The
index-build "M" (nearest neighbour count) and "efConstruction" (search
beamwidth to find nearest neighbours) interacts with search beamwidth, but
it is worth reviewing the technical descriptions of HNSW to understand more
about this.

The entire HNSW graph for all segments does not need to be in memory to
perform searching, but from my experience with large HNSW graphs, because
the appoximate nearest neighbour search jumps around the graph, unless it
is all fits in memory (OS's file system cache), you'll see lots of IO and
much slower query rates, particularly for large graphs and large values of
the search api's "k" parameter,

I believe that the HNSW graph for a new segment (whether being created due
to merging or new records) does need to be entirely in memory, allocated in
the JVM heap: I have had the need to increase the JVM heap when some
extremely large HNSW segment files have been merged.

best regards,

Kent Fitch

On Thu, Mar 16, 2023 at 10:00 AM Derek C <de...@hssl.ie> wrote:

> Hi all,
>
> This is still something I just don't understand and am very confused about
> : -
>
> With SOLR and KNN dense vector search :
>
> How, or when, does SOLR update/refresh it's HNSW table ?  When
> documents are added and deleted for example how, and when, does SOLR
> "refresh" the KNN search?  I also don't understand how it works when even
> adding the first initial documents with dense vector embeddings (but it
> works because we can execute KNN searches).  In this regard it seems very
> different than the "normal" index
>
> Is the HNSW table fully in-memory ?  If a SOLR node is stopped and started
> does it take time to do a rebuild of the HNSW in-memory table and will the
> potential results from a newly created HNSW table differ from other nodes?
> (I'm pretty sure we've seen that this is at least true when we've added new
> nodes to a cluster, and those nodes sync the documents, and the new nodes
> are returning different KNN results to the original nodes).
>
> thanks for any info
>
> Derek
>
> --
> Derek Conniffe
> Harvey Software Systems Ltd T/A HSSL
> Telephone (IRL): 086 856 3823
> Telephone (US): (650) 449 6044
> Skype: dconnrt
> Email: derek@hssl.ie
>
>
> *Disclaimer:* This email and any files transmitted with it are confidential
> and intended solely for the use of the individual or entity to whom they
> are addressed. If you have received this email in error please delete it
> (if you are not the intended recipient you are notified that disclosing,
> copying, distributing or taking any action in reliance on the contents of
> this information is strictly prohibited).
> *Warning*: Although HSSL have taken reasonable precautions to ensure no
> viruses are present in this email, HSSL cannot accept responsibility for
> any loss or damage arising from the use of this email or attachments.
> P For the Environment, please only print this email if necessary.
>