You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2022/02/04 09:18:00 UTC
[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

    [ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486894#comment-17486894 ] 

Adrien Grand commented on LUCENE-10404:
---------------------------------------

{quote}For example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search visits ~1000 - 15,000 docs depending on the recall.
{quote}
So if some searches at index time need to visit ~15k docs, then likely we'd need a hash set with a backing array that has 32,768 entries (smallest power of two that is 2x greater than the number of items in the hash set).

Since we'd never shrink the backing array, this means that even when we only visit 1,000 nodes, we would then have to clear the 32k entries of the backing array, so the hashset#clear calls might not be free (though still cheaper than on FixedBitSet).

This makes me wonder if we should consider using a specialized hash set implementation for this use-case, e.g. by using a FixedBitSet to track used buckets instead of sentinel values (which I believe is what HPPC's IntHashSet is doing). This would make {{IntHashSet#add}} calls a bit more expensive (maybe?), but then clearing the Set would only consist of clearing the BitSet that tracks used slots instead of the backing array, so we'd need to clear 32,768 bits = 4kB of data instead of 32,768*4=128kB.

> Use hash set for visited nodes in HNSW search?
> ----------------------------------------------
>
>                 Key: LUCENE-10404
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10404
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Julie Tibshirani
>            Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited using a BitSet. We could look into using something like IntHashSet instead. I tried out the idea quickly by switching to IntIntHashMap (which has already been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search visits ~1000 - 15,000 docs depending on the recall. This number can increase when searching with deleted docs, especially if you hit a "pathological" case where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org