You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Julie Tibshirani (Jira)" <ji...@apache.org> on 2022/02/04 01:54:00 UTC
[jira] [Updated] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

     [ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julie Tibshirani updated LUCENE-10404:
--------------------------------------
    Description: 
While searching each layer, HNSW tracks the nodes it has already visited using a BitSet. We could look into using something like IntHashSet instead. I tried out the idea quickly by switching to IntIntHashMap (which has already been copied from hppc) and saw an improvement in index performance. 

*Baseline:* 760896 msec to write vectors
*Using IntIntHashMap:* 733017 msec to write vectors

I noticed search performance actually got a little bit worse with the change -- that is something to look into.

For background, it's good to be aware that HNSW can visit a lot of nodes. For example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search visits ~1000 - 15,000 docs depending on the recall. This number can increase when searching with deleted docs, especially if you hit a "pathological" case where the deleted docs happen to be closest to the query vector.

  was:
While searching each layer, HNSW tracks the nodes it has already visited using a BitSet. We could look into using something like IntHashSet instead. I tried out the idea quickly by switching to IntIntHashMap (which has already been copied from hppc) and saw an improvement in index performance. 

**Baseline:** 760896 msec to write vectors
**Using IntIntHashMap:** 733017 msec to write vectors

I noticed search performance actually got a little bit worse with the change -- that is something to look into.

For background, it's good to be aware that HNSW can visit a lot of nodes. For example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search visits ~1000 - 15,000 docs depending on the recall. This number can increase when searching with deleted docs, especially if you hit a "pathological" case where the deleted docs happen to be closest to the query vector.


> Use hash set for visited nodes in HNSW search?
> ----------------------------------------------
>
>                 Key: LUCENE-10404
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10404
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Julie Tibshirani
>            Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited using a BitSet. We could look into using something like IntHashSet instead. I tried out the idea quickly by switching to IntIntHashMap (which has already been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search visits ~1000 - 15,000 docs depending on the recall. This number can increase when searching with deleted docs, especially if you hit a "pathological" case where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org