You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Julie Tibshirani (Jira)" <ji...@apache.org> on 2022/06/16 00:19:00 UTC
[jira] [Commented] (LUCENE-10577) Quantize vector values

    [ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554832#comment-17554832 ] 

Julie Tibshirani commented on LUCENE-10577:
-------------------------------------------

[~sokolov] this is a really exciting effort. Sorry that it's taken so long to review your latest PR (https://github.com/apache/lucene/pull/947). The latency improvement and space savings look great.

One concern with the PR is that the vector encoding logic "leaks" into different places like HnswGraphBuilder and HnswGraphSearcher. These classes now have split logic in several places to handle the quantization. It also felt surprising to use VectorSimilarityFunction to indicate that the elements should be stored as 8-bit integers. I think of similarity as just defining the comparison metric, but VectorSimilarityFunction.DOT_PRODUCT8 actually lowers the precision of the stored vectors.

One idea I had was to move configuration for quantizing values to the HNSW format. The format would also be in charge of computing the similarities through its VectorValues implementation. Maybe there could be a new function like this:

{code}
VectorValues#similarity(float[] query)
{code}

It would be free to operate directly on the bytes representation of the vectors. The other classes like HnswGraphSearcher, KnnVectorQuery, etc. could delegate to this method to compute query-document similarities.

We discussed a similar idea in https://issues.apache.org/jira/browse/LUCENE-10191, where Robert suggested that VectorSimilarityFunction should be a config on the kNN vectors format, instead of being its own concept that lives with KnnVectorField. This refactor would move us in a similar direction. Let me know if I'm not making any sense and I can post a rough PR with the idea :)


> Quantize vector values
> ----------------------
>
>                 Key: LUCENE-10577
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10577
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Michael Sokolov
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to support that it is not really necessary to store vectors in full precision. Perhaps users may also be willing to retrieve values in lower precision for whatever purpose those serve, if they are able to store more samples. We know that 8 bits is enough to provide a very near approximation to the same recall/performance tradeoff that is achieved with the full-precision vectors. I'd like to explore how we could enable 4:1 compression of these fields by reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide their data in reduced-precision format and give control over the quantization to them. It would have a major impact on the Lucene API surface though, essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would require no or perhaps very limited change to the existing API to enable the feature.
> I've been exploring (2), and what I find is that we can achieve very good recall results using dot-product similarity scoring by simple linear scaling + quantization of the vector values, so long as  we choose the scale that minimizes the quantization error. Dot-product is amenable to this treatment since vectors are required to be unit-length when used with that similarity function. 
>  Even still there is variability in the ideal scale over different data sets. A good choice seems to be max(abs(min-value), abs(max-value)), but of course this assumes that the data set doesn't have a few outlier data points. A theoretical range can be obtained by 1/sqrt(dimension), but this is only useful when the samples are normally distributed. We could in theory determine the ideal scale when flushing a segment and manage this quantization per-segment, but then numerical error could creep in when merging.
> I'll post a patch/PR with an experimental setup I've been using for evaluation purposes. It is pretty self-contained and simple, but has some drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly computing on byte values
> I'd like to get people's feedback on the approach and whether in general we should think about doing this compression under the hood, or expose a byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org