You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Jonathan Ellis <jb...@gmail.com> on 2023/08/05 13:46:40 UTC

Vamana greedy search variant

Hi all,

I put FINGER on pause to try out different graph construction methods.
Vamana (the in-memory component of DiskANN) looks interesting, especially
because it's a two-pass algorithm and the first pass is very similar to
"build L0 of hnsw."  So there is potentially a good fit there for using
HNSW in Cassandra to index and query vectors "online" but when we flush to
disk, optimize it with Vamana for about half as much work as if we started
from zero.

Here's what I'm seeing on the nytimes-256 dataset, at two different beam
widths:

HNSW: top 100 recall 0.8166, build 42.98s, query 14.89s. 42148760 nodes
visited

Vamana@2.0: top 100 recall 0.9050, build 15.71s, query 13.03s. 39525880
nodes visited

Not bad, right?  But here's the thing, that's not using exactly the same
search algorithm.  When I point the HnswGraphSearcher.searchLevel at the
Vamana graph, it's much worse:

Vamona@2.0: top 100 recall 0.7323, build 16.17s, query 9.45s. 44599250
nodes visited

It is faster, mostly (?) because I've barely put any time into optimizing
the Vamana search method.  But recall is much worse.

I've attached clips of the algorithm descriptions.  Setting aside a bit of
noise, the main difference is that hnsw says, "stop expanding nodes when
none of the candidates are closer to the query than the worst of the topk
found so far."  While Vamana keeps the topk results and candidates in the
*same* set [1] and says, "stop expanding nodes when none of this top L set
are unvisited" (where L >= k).  In the results above I manually set L to
4*k, since that makes the visited counts close to even.
.
Does anyone have any intuition for why, when pointed at the same graph,
this does better with the same (slightly fewer) nodes visited?

If you'd rather play with the code, rough cut is here:
https://github.com/jbellis/lucene/tree/concurrent-vamana, search code is
VamanaGraphBuilder.greedySearch.  Test harness branch at
https://github.com/jbellis/hnswdemo/tree/vamana.

[1] I implemented this as FixedNeighborArray.

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Vamana greedy search variant

Posted by Jonathan Ellis <jb...@gmail.com>.

Hi Jim, thanks for taking a look.

Yes, it's quite possible that replacing HNSW entirely with
FreshDiskANN-style Vamana would make sense.  I started with this approach
in part to better understand how much difference there was in applying a
Vamana pass to just using the HNSW level 0.  (It's still unclear, because
the difference in search algorithm quality complicates things.)

I'm not sure what property you're thinking of re efficient merging, AFAICT
the DiskANN algorithms are still designed node-at-a-time and the
alpha-pruning is basically a generalization of the HNSW diversity
heuristic, so if there were a way to merge multiple Vamana graphs
efficiently you should be able to do the same thing level-at-a-time with
HNSW.

I'm using the low-level HNSW code to build Cassandra indexes [1] [2], I was
pleased to find that Lucene is useful beyond Elastic and Solr.  I've opened
a PR to contribute my concurrent HNSW index here:
https://github.com/apache/lucene/pull/12421

[1]
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
[2] https://www.datastax.com/blog/vector-search-is-generally-available

On Sat, Aug 5, 2023 at 10:08 AM jim ferenczi <ji...@gmail.com> wrote:

> Hi Jonathan,
>
> Could you provide further clarification on your goal? The current
> description is unclear.
>
> Why construct an HNSW graph only to 'optimize' it into a Vamana graph? Why
> not directly build a Vamana graph? This paper
> <https://arxiv.org/pdf/2105.09613.pdf> provides guidance for streaming
> graph construction.
>
> An intriguing feature of the Vamana graph is its potential for more
> efficient merging of multiple graphs compared to rebuilding the entire HNSW
> graph for each merge. While I'm unsure of the process, it seems simpler to
> work within a single level.
>
> I'm also concerned about using HNSW code beyond the index writer. Flushing
> to disk is integral to the index codec. It might be advisable to employ a
> standard index writer rather than extracting code for external use,
> potentially disrupting the indexing chain.
>
> Best regards,
> Jim
>
> On Sat, 5 Aug 2023 at 22:47, Jonathan Ellis <jb...@gmail.com> wrote:
>
>> Hi all,
>>
>> I put FINGER on pause to try out different graph construction methods.
>> Vamana (the in-memory component of DiskANN) looks interesting, especially
>> because it's a two-pass algorithm and the first pass is very similar to
>> "build L0 of hnsw."  So there is potentially a good fit there for using
>> HNSW in Cassandra to index and query vectors "online" but when we flush to
>> disk, optimize it with Vamana for about half as much work as if we started
>> from zero.
>>
>> Here's what I'm seeing on the nytimes-256 dataset, at two different beam
>> widths:
>>
>> HNSW: top 100 recall 0.8166, build 42.98s, query 14.89s. 42148760 nodes
>> visited
>>
>> Vamana@2.0: top 100 recall 0.9050, build 15.71s, query 13.03s. 39525880
>> nodes visited
>>
>> Not bad, right?  But here's the thing, that's not using exactly the same
>> search algorithm.  When I point the HnswGraphSearcher.searchLevel at the
>> Vamana graph, it's much worse:
>>
>> Vamona@2.0: top 100 recall 0.7323, build 16.17s, query 9.45s. 44599250
>> nodes visited
>>
>> It is faster, mostly (?) because I've barely put any time into optimizing
>> the Vamana search method.  But recall is much worse.
>>
>> I've attached clips of the algorithm descriptions.  Setting aside a bit
>> of noise, the main difference is that hnsw says, "stop expanding nodes when
>> none of the candidates are closer to the query than the worst of the topk
>> found so far."  While Vamana keeps the topk results and candidates in the
>> *same* set [1] and says, "stop expanding nodes when none of this top L set
>> are unvisited" (where L >= k).  In the results above I manually set L to
>> 4*k, since that makes the visited counts close to even.
>> .
>> Does anyone have any intuition for why, when pointed at the same graph,
>> this does better with the same (slightly fewer) nodes visited?
>>
>> If you'd rather play with the code, rough cut is here:
>> https://github.com/jbellis/lucene/tree/concurrent-vamana, search code is
>> VamanaGraphBuilder.greedySearch.  Test harness branch at
>> https://github.com/jbellis/hnswdemo/tree/vamana.
>>
>> [1] I implemented this as FixedNeighborArray.
>>
>>
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Vamana greedy search variant

Posted by jim ferenczi <ji...@gmail.com>.

Hi Jonathan,

Could you provide further clarification on your goal? The current
description is unclear.

Why construct an HNSW graph only to 'optimize' it into a Vamana graph? Why
not directly build a Vamana graph? This paper
<https://arxiv.org/pdf/2105.09613.pdf> provides guidance for streaming
graph construction.

An intriguing feature of the Vamana graph is its potential for more
efficient merging of multiple graphs compared to rebuilding the entire HNSW
graph for each merge. While I'm unsure of the process, it seems simpler to
work within a single level.

I'm also concerned about using HNSW code beyond the index writer. Flushing
to disk is integral to the index codec. It might be advisable to employ a
standard index writer rather than extracting code for external use,
potentially disrupting the indexing chain.

Best regards,
Jim

On Sat, 5 Aug 2023 at 22:47, Jonathan Ellis <jb...@gmail.com> wrote:

> Hi all,
>
> I put FINGER on pause to try out different graph construction methods.
> Vamana (the in-memory component of DiskANN) looks interesting, especially
> because it's a two-pass algorithm and the first pass is very similar to
> "build L0 of hnsw."  So there is potentially a good fit there for using
> HNSW in Cassandra to index and query vectors "online" but when we flush to
> disk, optimize it with Vamana for about half as much work as if we started
> from zero.
>
> Here's what I'm seeing on the nytimes-256 dataset, at two different beam
> widths:
>
> HNSW: top 100 recall 0.8166, build 42.98s, query 14.89s. 42148760 nodes
> visited
>
> Vamana@2.0: top 100 recall 0.9050, build 15.71s, query 13.03s. 39525880
> nodes visited
>
> Not bad, right?  But here's the thing, that's not using exactly the same
> search algorithm.  When I point the HnswGraphSearcher.searchLevel at the
> Vamana graph, it's much worse:
>
> Vamona@2.0: top 100 recall 0.7323, build 16.17s, query 9.45s. 44599250
> nodes visited
>
> It is faster, mostly (?) because I've barely put any time into optimizing
> the Vamana search method.  But recall is much worse.
>
> I've attached clips of the algorithm descriptions.  Setting aside a bit of
> noise, the main difference is that hnsw says, "stop expanding nodes when
> none of the candidates are closer to the query than the worst of the topk
> found so far."  While Vamana keeps the topk results and candidates in the
> *same* set [1] and says, "stop expanding nodes when none of this top L set
> are unvisited" (where L >= k).  In the results above I manually set L to
> 4*k, since that makes the visited counts close to even.
> .
> Does anyone have any intuition for why, when pointed at the same graph,
> this does better with the same (slightly fewer) nodes visited?
>
> If you'd rather play with the code, rough cut is here:
> https://github.com/jbellis/lucene/tree/concurrent-vamana, search code is
> VamanaGraphBuilder.greedySearch.  Test harness branch at
> https://github.com/jbellis/hnswdemo/tree/vamana.
>
> [1] I implemented this as FixedNeighborArray.
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org