You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Julie Tibshirani (Jira)" <ji...@apache.org> on 2021/04/27 20:46:00 UTC

[jira] [Updated] (LUCENE-9941) ann-benchmarks results for HNSW indexing

     [ https://issues.apache.org/jira/browse/LUCENE-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julie Tibshirani updated LUCENE-9941:
-------------------------------------
    Description: 
This is a continuation of LUCENE-9937, but for HNSW index performance.

Approaches
 * LuceneVectorsOnly: a baseline that only indexes vectors
 * LuceneHnsw: our HNSW implementation, with a force merge to one segment
 * LuceneHnswNoForceMerge: our HNSW implementation without the force merge
 * hnswlib: a C++ HNSW implementation from the author of the paper

Datasets
 * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, comparing euclidean distance
 * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, comparing cosine similarity

*Results on sift-128-euclidean*
 Parameters: M=16, efConstruction=500
{code:java}
Approach                 Index time (sec)
LuceneVectorsOnly              14.93
LuceneHnsw                   3191.16
LuceneHnswNoForceMerge       1194.31
hnswlib                       311.09
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
Approach                  Index time (sec)
LuceneVectorsOnly              14.17
LuceneHnsw                   8940.41
LuceneHnswNoForceMerge       3623.68 
hnswlib                       587.23
{code}
We force merge to one segment to emulate a case where vectors aren't continually being indexed. In these situations, it seems likely users would force merge to optimize search speed: searching a single large graph is expected to be faster than searching several small ones serially. To see how long the force merge takes, we can subtract LuceneHnswNoForceMerge from LuceneHnsw.

The construction parameters match those in LUCENE-9937 and are optimized for search recall + QPS instead of index speed, as I figured this would be a common set-up.

Some observations:
 * In cases when segments are eventually force merged, we do a lot of extra work building intermediate graphs that are eventually merged away. This is a difficult problem, and one that's been raised in the past. As a simple step, I wonder if we should not build graphs for segments that are below a certain size. For sufficiently small segments, it could be a better trade-off to avoid building a graph and support nearest-neighbor search through a brute-force scan?
 * Indexing is slow compared to what we're used to for other formats, even if we disregard the extra work mentioned above. For sift-128-euclidean, building only the final graph takes ~33 min, whereas for glove-100-angular it's ~88 min.
 * As a note, graph indexing uses ANN searches in order to add each new vector to the graph. So the slower search speed between Lucene and hnswlib may contribute to slower indexing.

  was:
This is a continuation of LUCENE-9937, but for HNSW index performance.

Approaches
 * LuceneVectorsOnly: a baseline that only indexes vectors
 * LuceneHnsw: our HNSW implementation, with a force merge to one segment
 * LuceneHnswNoForceMerge: our HNSW implementation without the force merge
 * hnswlib: a C++ HNSW implementation from the author of the paper

Datasets
 * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, euclidean distance
 * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, euclidean distance

*Results on sift-128-euclidean*
 Parameters: M=16, efConstruction=500
{code:java}
Approach                 Index time (sec)
LuceneVectorsOnly              14.93
LuceneHnsw                   3191.16
LuceneHnswNoForceMerge       1194.31
hnswlib                       311.09
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
Approach                  Index time (sec)
LuceneVectorsOnly              14.17
LuceneHnsw                   8940.41
LuceneHnswNoForceMerge       3623.68 
hnswlib                       587.23
{code}
We force merge to one segment to emulate a case where vectors aren't continually being indexed. In these situations, it seems likely users would force merge to optimize search speed: searching a single large graph is expected to be faster than searching several small ones serially. To see how long the force merge takes, we can subtract LuceneHnswNoForceMerge from LuceneHnsw.

The construction parameters match those in LUCENE-9937 and are optimized for search recall + QPS instead of index speed, as I figured this would be a common set-up.

Some observations:
 * In cases when segments are eventually force merged, we do a lot of extra work building intermediate graphs that are eventually merged away. This is a difficult problem, and one that's been raised in the past. As a simple step, I wonder if we should not build graphs for segments that are below a certain size. For sufficiently small segments, it could be a better trade-off to avoid building a graph and support nearest-neighbor search through a brute-force scan?
 * Indexing is slow compared to what we're used to for other formats, even if we disregard the extra work mentioned above. For sift-128-euclidean, building only the final graph takes ~33 min, whereas for glove-100-angular it's ~88 min.
 * As a note, graph indexing uses ANN searches in order to add each new vector to the graph. So the slower search speed between Lucene and hnswlib may contribute to slower indexing.


> ann-benchmarks results for HNSW indexing
> ----------------------------------------
>
>                 Key: LUCENE-9941
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9941
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Julie Tibshirani
>            Priority: Minor
>
> This is a continuation of LUCENE-9937, but for HNSW index performance.
> Approaches
>  * LuceneVectorsOnly: a baseline that only indexes vectors
>  * LuceneHnsw: our HNSW implementation, with a force merge to one segment
>  * LuceneHnswNoForceMerge: our HNSW implementation without the force merge
>  * hnswlib: a C++ HNSW implementation from the author of the paper
> Datasets
>  * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, comparing euclidean distance
>  * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, comparing cosine similarity
> *Results on sift-128-euclidean*
>  Parameters: M=16, efConstruction=500
> {code:java}
> Approach                 Index time (sec)
> LuceneVectorsOnly              14.93
> LuceneHnsw                   3191.16
> LuceneHnswNoForceMerge       1194.31
> hnswlib                       311.09
> {code}
> *Results on glove-100-angular*
>  Parameters: M=32, efConstruction=500
> {code:java}
> Approach                  Index time (sec)
> LuceneVectorsOnly              14.17
> LuceneHnsw                   8940.41
> LuceneHnswNoForceMerge       3623.68 
> hnswlib                       587.23
> {code}
> We force merge to one segment to emulate a case where vectors aren't continually being indexed. In these situations, it seems likely users would force merge to optimize search speed: searching a single large graph is expected to be faster than searching several small ones serially. To see how long the force merge takes, we can subtract LuceneHnswNoForceMerge from LuceneHnsw.
> The construction parameters match those in LUCENE-9937 and are optimized for search recall + QPS instead of index speed, as I figured this would be a common set-up.
> Some observations:
>  * In cases when segments are eventually force merged, we do a lot of extra work building intermediate graphs that are eventually merged away. This is a difficult problem, and one that's been raised in the past. As a simple step, I wonder if we should not build graphs for segments that are below a certain size. For sufficiently small segments, it could be a better trade-off to avoid building a graph and support nearest-neighbor search through a brute-force scan?
>  * Indexing is slow compared to what we're used to for other formats, even if we disregard the extra work mentioned above. For sift-128-euclidean, building only the final graph takes ~33 min, whereas for glove-100-angular it's ~88 min.
>  * As a note, graph indexing uses ANN searches in order to add each new vector to the graph. So the slower search speed between Lucene and hnswlib may contribute to slower indexing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org