You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Alessandro Benedetti <a....@sease.io> on 2023/03/06 21:55:32 UTC

Re: KNN HNSW - performance over time with document updates

Hi Derek,
if you plan to move to 9.1 I would recommend 9.1.1 where we fixed a couple
of bugs related to the vector-based search feature.

In terms of what happens when you push new documents with vectors, it's not
much different than what happens generally in Lucene/Solr: segments will be
merged and the related HNSW data structures are not immune to this.
Long story short, you should not be worried about a full re-index all the
time.

In terms of playing with the advanced parameters, I would probably touch
them if you are unhappy with the defaults.
They affect the way the graph is built and impact indexing time and query
time performance (and quality).
To understand them better you can read the original paper or this blog post
I like very much: https://www.pinecone.io/learn/hnsw/

In regards to the results looking funny, how are the vectors inferred from
the images?
If you change the K in top-K, any better?
There are many points where this sort of search may go wrong, it may be
Solr's fault or not :)

Cheers

--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Wed, 1 Mar 2023 at 00:08, Kent Fitch <ke...@gmail.com> wrote:

> Hi Derek,
>
> I'm not sure how your image embeddings were generated, but as you probably
> know, I think it is only by experiment in each case that you can determine
> how far you can reduce the dimensions and/or compress the encoding values
> of each dimension before too-detrimental effects on nearest-neighbour
> scoring occur.  But I'd hazard a guess that encoding 512 vector float
> values as 512 bytes using 512 code books generated by k-means clustering on
> each dimension (or fewer codebooks if you're lucky - as I mentioned,
> ada-002 value distributions for our use-case meant just 2 code books were
> needed for its 1536 values when we tried that approach before we then moved
> onto PQ coding) would preserve almost all the original embedding
> information and reduce your HNSW index size by 75%, at the cost of
> requiring a custom similarity class to use the code-books.
>
> For the index I mentioned (160m docs, ada-002 encoding with 1536 floats
> represented by 512 bytes using PQ coding to represent 3 floats as 1 byte),
> the HNSW index (.vex .vem .vec files) is about 87GB.  If all I am doing is
> knn queries and retrieving a document id from the result, the OS file cache
> readily caches everything on a 2018-era Intel i7-9800 (8 cores, 16 threads)
> with 128GB DDR - there is no IO after the initial cache population, with a
> 4GB heap for Lucene, and with a search beamwidth ('k") of 3 and 16 search
> threads, a total sustained query rate of about 32 queries/sec is
> maintained.  Yes, it is CPU intensive, because those 512 bytes still get
> expanded to 1536 floats which need multiplying and summing, and over 24K
> "probes" are required on average to build the result set for each query.
>
> HNSW certainly issues smallish random probes across its index and query
> rates (and CPU usage) decline rapidly if it doesnt fit in memory, even with
> 16 nvme lanes into the CPU.  If you can move some memory allocation from
> the SOLR jvm to the file cache, that may help.
>
> The only time I've needed a big JVM was when constructing the index:
> towards the end of the build, some segment merges required a lot of memory:
> I guess with multiple segment merges happening in the JVM, and each dealing
> with multiple and large memory representations of their incoming and
> outgoing segment's HNSW graphes, a lot of heap is required!
>
> best regards
>
> Kent Fitch
>
> On Wed, Mar 1, 2023 at 2:23 AM Derek C <de...@hssl.ie> wrote:
>
> > Hi Kent,
> >
> > That's very interesting.  We have been thinking about reducing,
> > down-scaling, our dense vectors from 512 to 64 perhaps using PCA.
> > We have about 2.5 million documents and we did some testing (with Apache
> > JMeter) and after about 10 concurrent requests we start
> > to have performance problems (SOLR seems to stall until we reduce the
> load
> > for a while) so reduced embedding sizes may really
> > help with this.
> >
> > Just out of curiosity - when you were testing with up to 160M documents
> > with 512 long embeddings were you using a single
> > massive computer ?  I've found that performance is OK/useable with
> 64Gbytes
> > of RAM where SOLR has 30Gbytes and the O/S has the
> > remainder with the SOLR collection/core being around 20Gbytes so within
> the
> > amount the O/S can cache the disk I/O.
> >
> > Derek
> >
> > On Mon, Feb 27, 2023 at 5:16 AM Kent Fitch <ke...@gmail.com> wrote:
> >
> > > Hi Derek,
> > >
> > > I have been trying a few settings with HNSW in Lucene/SOLR, and whilst
> my
> > > experiences may not be directly relevant to you, they may provide some
> > > background.
> > >
> > > My tests have been with an index of up to 160M records containing a 512
> > > element byte embedding.  The  original embeddings were of text articles
> > > (average length about 450 words) generated by openAI's ada-002 as 1536
> > > floats, but then encoding as 512 bytes by encoding groups of 3 floats
> as
> > 1
> > > byte using PQ encoding using the method described here:
> > >
> > >
> >
> https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf
> > > The motivation for PQ encoding is basically to reduce index size.  A
> > first
> > > attempt at encoding the floats as bytes worked well (I tried to to
> > minimise
> > > error by analysing the distribution of float values across the 1536
> > > dimensions, and noticed that all but 5 of the dimensions had a very
> > > narrow range for most embeddings, and so using k-means clustering to
> find
> > > 256 values for those dimensions, and another 256 values for the 5
> > "outlier"
> > > dimensions yielded good results).  However, each vector still occupied
> > 1536
> > > bytes, and HNSW really needs these to be in memory as otherwise  the
> IOs
> > to
> > > even the RAID 10 nvme devices connected to their own PCIE3 lanes will
> > cause
> > > slow query rates.  So quantising 3 floats into 1 byte was attractive.
> > > Again, I used k-means on each of the 512 x 3 byte groups to get 256
> > > "centroids" to minimise error.  The downside of this approach is the
> need
> > > to define a custom similarity that reads at initiation the 512 centroid
> > > tables (each with 256 mappings to expand a byte code to 3
> floating-point
> > > numbers representing a "centroid" point).
> > >
> > > Anyway, the loss caused by this mapping is real but not particularly
> > > consequential: some result lists are slightly degraded/reordered, but
> > HNSW
> > > is an "approximate nearest neighbour" search anyway.
> > >
> > > How sure are you that the unexpected search results you are reporting
> are
> > > caused by the HNSW ANN rather than the encoding?  For example, if you
> run
> > > an exhaustive search on your 2m records to find the "real" nearest
> > > neighbours to some point representing some base document, how do the
> > > results differ from your HNSW search with various search beamwidths
> > > (provided as the "k" parameter on the KnnByteVectorQuery constructor)?
> > >
> > > Although not directly relevant to your use-case, results I'm seeing on
> an
> > > index of 160M documents with a ada-002 embedding quantised to 512 bytes
> > > using a recent (11Feb23) Lucene built with a "M" of 64 and a
> construction
> > > "beamwidth" of 120 and with a custom similarity:
> > >
> > > with a  search "k" of 1, the "real" closest match is returned 56% of
> the
> > > time and requires 18K similarity comparisons.
> > > with a search "k" of 2, the "real" closest match us returned as the top
> > > match 61% of the time and requires 22K comparisons
> > > with "k" of 3, 64%, 24K comparisons
> > > "k" of 5, 70%, 29K
> > > "k" of 10, 78%, 37K
> > > "k" of 20, 87%, 48K
> > > "k" of 50, 94%, 63K
> > > "k" of 120, 97%, 121K
> > >
> > > The nature of the embeddings I loaded is that many are very similar
> > > (basically, randomish variations on a much smaller set of "base"
> > articles,
> > > as we couldnt afford to get embeddings for 160M articles for this test
> -
> > we
> > > are just trying to test whether Lucene's HNSW is feasible for our
> > > use-case), so in the overwhelming majority of "misses", the top article
> > is
> > > indeed very similar to the article sought.  That is, for our use case,
> > the
> > > results are satisfactory, even with the "down-scaling" of the embedding
> > to
> > > 512 bytes.
> > >
> > > best regards
> > >
> > > Kent Fitch
> > >
> > >
> > >
> > > On Mon, Feb 27, 2023 at 5:02 AM Derek C <de...@hssl.ie> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'm a bit uncertain how KNN with HNSW works in SOLR with dense vector
> > > > fields and searching.
> > > >
> > > > Recently I've been doing tests loading dense vectors after
> inferencing
> > > > [images] and then checking by eye the closest matches and the results
> > > look
> > > > funny (very similar images not being the nearest results as I'd
> > normally
> > > > expect).
> > > >
> > > > I'm unclear about HNSW in general (like what are the best policies,
> or
> > a
> > > > good guide or starting point, for choosing hnswMaxConnections and
> > > > hnswBeamWidth values if you know the dense vector length (512) and
> you
> > > know
> > > > you have 2 million+ documents).
> > > >
> > > > But one thing I'm wondering right now is with a dataset over time,
> > where
> > > > documents have been added and documents have been removed over time,
> > can
> > > > this affect the KNN search (i.e. is it better if all documents, or at
> > > least
> > > > the dense vector field, had be indexed fresh) ?
> > > >
> > > > BTW I haven't yet moved from SOLR 9.0 to 9.1 but I do read that the
> > HNSW
> > > > codec has changed in some way so a reindex is required - I should
> > > probably
> > > > try 9.1 (I would prioritise this if anyone says 9.1 is better quality
> > or
> > > > better performance for KNN searches!).
> > > >
> > > > Thanks for any info!
> > > >
> > > > Derek
> > > >
> > > > --
> > > > Derek Conniffe
> > > > Harvey Software Systems Ltd T/A HSSL
> > > > Telephone (IRL): 086 856 3823
> > > > Telephone (US): (650) 449 6044
> > > > Skype: dconnrt
> > > > Email: derek@hssl.ie
> > > >
> > > >
> > > > *Disclaimer:* This email and any files transmitted with it are
> > > confidential
> > > > and intended solely for the use of the individual or entity to whom
> > they
> > > > are addressed. If you have received this email in error please delete
> > it
> > > > (if you are not the intended recipient you are notified that
> > disclosing,
> > > > copying, distributing or taking any action in reliance on the
> contents
> > of
> > > > this information is strictly prohibited).
> > > > *Warning*: Although HSSL have taken reasonable precautions to ensure
> no
> > > > viruses are present in this email, HSSL cannot accept responsibility
> > for
> > > > any loss or damage arising from the use of this email or attachments.
> > > > P For the Environment, please only print this email if necessary.
> > > >
> > >
> >
> >
> > --
> > --
> > Derek Conniffe
> > Harvey Software Systems Ltd T/A HSSL
> > Telephone (IRL): 086 856 3823
> > Telephone (US): (650) 449 6044
> > Skype: dconnrt
> > Email: derek@hssl.ie
> >
> >
> > *Disclaimer:* This email and any files transmitted with it are
> confidential
> > and intended solely for the use of the individual or entity to whom they
> > are addressed. If you have received this email in error please delete it
> > (if you are not the intended recipient you are notified that disclosing,
> > copying, distributing or taking any action in reliance on the contents of
> > this information is strictly prohibited).
> > *Warning*: Although HSSL have taken reasonable precautions to ensure no
> > viruses are present in this email, HSSL cannot accept responsibility for
> > any loss or damage arising from the use of this email or attachments.
> > P For the Environment, please only print this email if necessary.
> >
>