You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Alessandro Benedetti (Jira)" <ji...@apache.org> on 2022/06/23 13:35:00 UTC

[jira] [Comment Edited] (LUCENE-10593) VectorSimilarityFunction reverse removal

    [ https://issues.apache.org/jira/browse/LUCENE-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558093#comment-17558093 ] 

Alessandro Benedetti edited comment on LUCENE-10593 at 6/23/22 1:34 PM:
------------------------------------------------------------------------

Hi @msokolov @mayya-sharipova and @jtibshirani , I have finally finished my performance tests.
Initially the results were worse in this branch, I found that suspicious as I expected the removal of the BoundChecker and the removal of the reverse mechanism to outweigh the additional division in the distance measure during graph building and searching.

After a deep investigation I found the culprit (you see it in the latest commit).


{code:java}
if (neighborSimilarity >= score) {
if ((neighborSimilarity < score) == false) { // this version improves the performance dramatically in both indexing/searching
{code}


After that fix, the results are very encouraging.
There are strong speedup for both angular and euclidean distances, both for indexing and searching.
*If this is validated we are getting a great cleanup of the code and also a nice performance boost.*
I'll have my colleague @eliaporciani to repeat the tests on Apple M1.

The following tests were executed on Intellij running the org.apache.lucene.util.hnsw.KnnGraphTester.
2.4 GHz 8-Core Intel Core i9 - 32 GB 2667 MHz DDR4


{noformat}
`INDEXING EUCLIDEAN

-beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5 -metric euclidean

ORIGINAL
IW 0 [2022-06-22T14:00:12.647030Z; main]: 64335 msec to write vectors
IW 0 [2022-06-22T14:01:57.425108Z; main]: 65710 msec to write vectors
IW 0 [2022-06-22T14:03:18.052900Z; main]: 64817 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T14:04:50.683607Z; main]: 6597 msec to write vectors
IW 0 [2022-06-22T14:05:34.090801Z; main]: 6687 msec to write vectors
IW 0 [2022-06-22T14:06:00.268309Z; main]: 6564 msec to write vectors

INDEXING ANGULAR

-beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5 -metric angular

ORIGINAL
IW 0 [2022-06-22T13:55:45.401310Z; main]: 32897 msec to write vectors
IW 0 [2022-06-22T13:56:39.737642Z; main]: 33255 msec to write vectors
IW 0 [2022-06-22T13:57:31.172709Z; main]: 32576 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T13:52:06.085790Z; main]: 25261 msec to write vectors
IW 0 [2022-06-22T13:52:51.022766Z; main]: 25775 msec to write vectors
IW 0 [2022-06-22T13:53:47.565833Z; main]: 24523 msec to write vectors`

`SEARCH EUCLIDEAN

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5 -search /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5 -metric euclidean

ORIGINAL
completed 500 searches in 1026 ms: 487 QPS CPU time=1025ms
completed 500 searches in 1030 ms: 485 QPS CPU time=1029ms
completed 500 searches in 1031 ms: 484 QPS CPU time=1030ms

THIS BRANCH
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 47 ms: 10638 QPS CPU time=46ms

SEARCH ANGULAR

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5 -search /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5 -metric angular

ORIGINAL
completed 500 searches in 154 ms: 3246 QPS CPU time=153ms
completed 500 searches in 162 ms: 3086 QPS CPU time=162ms
completed 500 searches in 166 ms: 3012 QPS CPU time=166ms

THIS BRANCH
completed 500 searches in 62 ms: 8064 QPS CPU time=62ms
completed 500 searches in 65 ms: 7692 QPS CPU time=65ms
completed 500 searches in 63 ms: 7936 QPS CPU time=62ms
`
{noformat}



Please correct me in case I did anything wrong, it's the first time I was using the org.apache.lucene.util.hnsw.KnnGraphTester

I am proceeding in running additional performance tests on different datasets.
Functional tests are all green.


was (Author: alessandro.benedetti):
Hi @msokolov @mayya-sharipova and @jtibshirani , I have finally finished my performance tests.
Initially the results were worse in this branch, I found that suspicious as I expected the removal of the BoundChecker and the removal of the reverse mechanism to outweigh the additional division in the distance measure during graph building and searching.

After a deep investigation I found the culprit (you see it in the latest commit).


{code:java}
if (neighborSimilarity >= score) {
if ((neighborSimilarity < score) == false) { // this version improves the performance dramatically in both indexing/searching
{code}


After that fix, the results are very encouraging.
There are strong speedup for both angular and euclidean distances, both for indexing and searching.
*If this is validated we are getting a great cleanup of the code and also a nice performance boost.*
I'll have my colleague @eliaporciani to repeat the tests on Apple M1.

The following tests were executed on Intellij running the org.apache.lucene.util.hnsw.KnnGraphTester.
2.4 GHz 8-Core Intel Core i9 - 32 GB 2667 MHz DDR4


{noformat}
`INDEXING EUCLIDEAN

-beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5 -metric euclidean

ORIGINAL
IW 0 [2022-06-22T14:00:12.647030Z; main]: 64335 msec to write vectors
IW 0 [2022-06-22T14:01:57.425108Z; main]: 65710 msec to write vectors
IW 0 [2022-06-22T14:03:18.052900Z; main]: 64817 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T14:04:50.683607Z; main]: 6597 msec to write vectors
IW 0 [2022-06-22T14:05:34.090801Z; main]: 6687 msec to write vectors
IW 0 [2022-06-22T14:06:00.268309Z; main]: 6564 msec to write vectors

INDEXING ANGULAR

-beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5 -metric angular

ORIGINAL
IW 0 [2022-06-22T13:55:45.401310Z; main]: 32897 msec to write vectors
IW 0 [2022-06-22T13:56:39.737642Z; main]: 33255 msec to write vectors
IW 0 [2022-06-22T13:57:31.172709Z; main]: 32576 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T13:52:06.085790Z; main]: 25261 msec to write vectors
IW 0 [2022-06-22T13:52:51.022766Z; main]: 25775 msec to write vectors
IW 0 [2022-06-22T13:53:47.565833Z; main]: 24523 msec to write vectors`

`SEARCH EUCLIDEAN

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5 -search /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5 -metric euclidean

ORIGINAL
completed 500 searches in 1026 ms: 487 QPS CPU time=1025ms
completed 500 searches in 1030 ms: 485 QPS CPU time=1029ms
completed 500 searches in 1031 ms: 484 QPS CPU time=1030ms

THIS BRANCH
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 47 ms: 10638 QPS CPU time=46ms

SEARCH ANGULAR

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5 -search /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5 -metric angular

ORIGINAL
completed 500 searches in 154 ms: 3246 QPS CPU time=153ms
completed 500 searches in 162 ms: 3086 QPS CPU time=162ms
completed 500 searches in 166 ms: 3012 QPS CPU time=166ms

THIS BRANCH
completed 500 searches in 62 ms: 8064 QPS CPU time=62ms
completed 500 searches in 65 ms: 7692 QPS CPU time=65ms
completed 500 searches in 63 ms: 7936 QPS CPU time=62ms
`
{noformat}



Please correct me in case I did anything wrong, it's the first time I was using the org.apache.lucene.util.hnsw.KnnGraphTester

> VectorSimilarityFunction reverse removal
> ----------------------------------------
>
>                 Key: LUCENE-10593
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10593
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Alessandro Benedetti
>            Priority: Major
>              Labels: vector-based-search
>
> org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves in an opposite way in comparison to the other similarities:
> A higher similarity score means higher distance, for this reason, has been marked with "reversed" and a function is present to map from the similarity to a score (where higher means closer, like in all other similarities.)
> Having this counterintuitive behavior with no apparent explanation I could find(please correct me if I am wrong) brings a lot of nasty side effects for the code readability, especially when combined with the NeighbourQueue that has a "reversed" itself.
> In addition, it complicates also the usage of the pattern:
> Result Queue -> MIN HEAP
> Candidate Queue -> MAX HEAP
> In HNSW searchers.
> The proposal in my Pull Request aims to:
> 1) the Euclidean similarity just returns the score, in line with the other similarities, with the formula currently used to move from distance to score
> 2) simplify the code, removing the bound checker that's not necessary anymore
> 3) refactor here and there to be in line with the simplification
> 4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or MAX_HEAP, now debugging is much easier and understanding the HNSW code is much more intuitive



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org