You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Jonathan Ellis <jb...@gmail.com> on 2023/05/09 02:20:56 UTC

CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

Hi all,

Following the recent discussion threads, I would like to propose CEP-30 to
add Approximate Nearest Neighbor (ANN) Vector Search via Storage-Attached
Indexes (SAI) to Apache Cassandra.

The primary goal of this proposal is to implement ANN vector search
capabilities, making Cassandra more useful to AI developers and
organizations managing large datasets that can benefit from fast similarity
search.

The implementation will leverage Lucene's Hierarchical Navigable Small
World (HNSW) library and introduce a new CQL data type for vector
embeddings, a new SAI index for ANN search functionality, and a new CQL
operator for performing ANN search queries.

We are targeting the 5.0 release for this feature, in conjunction with the
release of SAI. The proposed changes will maintain compatibility with
existing Cassandra functionality and compose well with the already-approved
SAI features.

Please find the full CEP document here:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

Posted by David Capwell <dc...@apple.com>.
Thanks for the update, LGTM

> On May 17, 2023, at 5:35 AM, Jasonstack Zhao Yang <ja...@gmail.com> wrote:
> 
> Hi,
> 
> I have updated the CEP with some details about distributed queries in the Approach section.
> 
> David:
> 
> > given results have a real ranking, the current 2i logic may yield incorrect results
> 
> C* internal iterators are all in primary key order. So we need two in-memory top-k filters, one at replica side and one at coordinator side, to make sure the returned rows are actually top-k but still primary key order.
> 
> > if 1 of the queries fails and can’t fall back to peers… does the query fail (I assume so)
> 
> yes, it will fail. we can make it pass if lower recall is acceptable.
> 
> Caleb:
> 
> > With smaller clusters or use-cases that are extremely write-heavy/read-light, it's possible that the full scatter/gather won't be too onerous, especially w/ a few small tweaks (on top of a non-vnode cluster)
> 
> You are right. Smaller cluster would definitely requires less coordinator memory to cache all required replicas' responses.
> 
> 
> Jeremy:
> 
> >  With SAI, can you have partial results?  When you have a query that is non-key based, you need to have full token range coverage of the results.  If that isn't possible, will Vector Search/SAI return partial results?
> 
> No partial result allowed. Query will failed with unavailability exception if some required token range is not available. For ANN search, users might be willing to have lower recall (partial results) with higher availability.
> 
> >  First, how is ordering/scoring done?
> > Each replica returns back to the coordinator a sorted set of results and the coordinator will have to see all of the results globally in order to do a global ordering.  You can't know what the top result is unless you've seen everything.  As to the scoring, I'm not sure how that will get calculated.
> 
> The results will be top-k but still in primary key order. Scores are computed based on vector similarly function.
> 
> Top-K search need two top-k filter as described in CEP. 
> 
> > Second, if I am ordering the results like for a Vector Search and I want to have the top 1 result.  How is the scoring done and what happens if there are 20 that have the same score?  How will the coordinator decide which 1 is returned out of 20?
> 
> It will be the row with smaller primary key order.
> 
> On Wed, 10 May 2023 at 05:39, Jeremy Hanna <jeremy.hanna1234@gmail.com <ma...@gmail.com>> wrote:
>> Just wanted to add that I don't have any special knowledge of CEP-30 beyond what Jonathan posted and just trying to help clarify and answer questions as I can with some knowledge and experience from DSE Search and SAI.  Thanks to Caleb for helping validate some things as well.  And to be clear about partial results - the default with DSE Search at least is to fail a query if it can't get the full token range coverage.  However there is an option to allow for shards being unavailable and return partial results.
>> 
>>> On May 9, 2023, at 3:38 PM, Jeremy Hanna <Jeremy.Hanna1234@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> I talked to David and some others in slack to hopefully clarify:
>>> 
>>> With SAI, can you have partial results?  When you have a query that is non-key based, you need to have full token range coverage of the results.  If that isn't possible, will Vector Search/SAI return partial results?
>>> 
>>> Anything can happen in the implementation, but for scoring, it may not make sense to return partial results because it's misleading.  For non-global queries, it could or couldn't return partial results depending on implementation/configuration.  In DSE you could have partial results depending on the options.   However I couldn't find partial results defined in CEP-7 or CEP-30.
>>> 
>>> The other questions are about scoring.
>>> 
>>> First, how is ordering/scoring done?
>>> 
>>> Each replica returns back to the coordinator a sorted set of results and the coordinator will have to see all of the results globally in order to do a global ordering.  You can't know what the top result is unless you've seen everything.  As to the scoring, I'm not sure how that will get calculated.
>>> 
>>> Second, if I am ordering the results like for a Vector Search and I want to have the top 1 result.  How is the scoring done and what happens if there are 20 that have the same score?  How will the coordinator decide which 1 is returned out of 20?
>>> 
>>> It returns results in token/partition and then clustering order.
>>> 
>>>> On May 9, 2023, at 2:53 PM, Caleb Rackliffe <calebrackliffe@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Anyone on this ML who still remembers DSE Search (or has experience w/ Elastic or SolrCloud) probably also knows that there are some significant pieces of an optimized scatter/gather apparatus for IR (even without sorting, which also doesn't exist yet) that do not exist in C* or it's range query system (which SAI and all other 2i implementations use). SAI, like all C* 2i implementations, is still a local index, and as that is the case, anything built on it will perform best in partition-scoped (at least on the read side) use-cases. (On the bright side, the project is moving toward larger partitions being a possibility.) With smaller clusters or use-cases that are extremely write-heavy/read-light, it's possible that the full scatter/gather won't be too onerous, especially w/ a few small tweaks (on top of a non-vnode cluster) to a.) keep fanout minimal and b.) keep range/index queries to a single pass to minimize latency.
>>>> 
>>>> Whatever we do, we just need to avoid a situation down the road where users don't understand these nuances and hit a wall where they try to use this in a way that is fundamentally incompatible w/ the way the database scales/works. (I've done my best to call this out in all discussions around SAI over time, and there may even end up being further guardrails put in place to make it even harder to misuse it...but I digress.)
>>>> 
>>>> Having said all that, I don't fundamentally have a problem w/ the proposal.
>>>> 
>>>> On Tue, May 9, 2023 at 2:11 PM Benedict <benedict@apache.org <ma...@apache.org>> wrote:
>>>>> HNSW can in principle be made into a distributed index. But that would be quite a different paradigm to SAI.
>>>>> 
>>>>>> On 9 May 2023, at 19:30, Patrick McFadin <pmcfadin@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> 
>>>>>> Under the goals section, there is this line:
>>>>>> 
>>>>>> Scatter/gather across replicas, combining topK from each to get global topK.
>>>>>> 
>>>>>> But what I'm hearing is, exactly how will that happen? Maybe this is an SAI question too. How is that verified in SAI?
>>>>>> 
>>>>>> On Tue, May 9, 2023 at 11:07 AM David Capwell <dcapwell@apple.com <ma...@apple.com>> wrote:
>>>>>>> Approach section doesn’t go over how this will handle cross replica search, this would be good to flesh out… given results have a real ranking, the current 2i logic may yield incorrect results… so would think we need num_ranges / rf queries in the best case, with some new capability to sort the results?  If my assumption is correct, then how errors are handled should also be fleshed out… Example: 1k cluster without vnode and RF=3, so 333 queries fanned out to match, then coordinator needs to sort… if 1 of the queries fails and can’t fall back to peers… does the query fail (I assume so)?
>>>>>>> 
>>>>>>>> On May 8, 2023, at 7:20 PM, Jonathan Ellis <jbellis@gmail.com <ma...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> Following the recent discussion threads, I would like to propose CEP-30 to add Approximate Nearest Neighbor (ANN) Vector Search via Storage-Attached Indexes (SAI) to Apache Cassandra.
>>>>>>>> 
>>>>>>>> The primary goal of this proposal is to implement ANN vector search capabilities, making Cassandra more useful to AI developers and organizations managing large datasets that can benefit from fast similarity search.
>>>>>>>> 
>>>>>>>> The implementation will leverage Lucene's Hierarchical Navigable Small World (HNSW) library and introduce a new CQL data type for vector embeddings, a new SAI index for ANN search functionality, and a new CQL operator for performing ANN search queries.
>>>>>>>> 
>>>>>>>> We are targeting the 5.0 release for this feature, in conjunction with the release of SAI. The proposed changes will maintain compatibility with existing Cassandra functionality and compose well with the already-approved SAI features.
>>>>>>>> 
>>>>>>>> Please find the full CEP document here: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Jonathan Ellis
>>>>>>>> co-founder, http://www.datastax.com <http://www.datastax.com/>
>>>>>>>> @spyced
>>>>>>> 
>>> 
>> 


Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

Posted by Jasonstack Zhao Yang <ja...@gmail.com>.
Hi,

I have updated the CEP with some details about distributed queries in the
*Approach* section.

David:

> given results have a real ranking, the current 2i logic may yield
incorrect results

C* internal iterators are all in primary key order. So we need two
in-memory top-k filters, one at replica side and one at coordinator side,
to make sure the returned rows are actually top-k but still primary key
order.

> if 1 of the queries fails and can’t fall back to peers… does the query
fail (I assume so)

yes, it will fail. we can make it pass if lower recall is acceptable.

Caleb:

> With smaller clusters or use-cases that are extremely
write-heavy/read-light, it's possible that the full scatter/gather won't be
too onerous, especially w/ a few small tweaks (on top of a non-vnode
cluster)

You are right. Smaller cluster would definitely requires less coordinator
memory to cache all required replicas' responses.


Jeremy:

>  With SAI, can you have partial results?  When you have a query that is
non-key based, you need to have full token range coverage of the results.
If that isn't possible, will Vector Search/SAI return partial results?

No partial result allowed. Query will failed with unavailability exception
if some required token range is not available. For ANN search, users might
be willing to have lower recall (partial results) with higher availability.

>  First, how is ordering/scoring done?
> Each replica returns back to the coordinator a sorted set of results and
the coordinator will have to see all of the results globally in order to do
a global ordering.  You can't know what the top result is unless you've
seen everything.  As to the scoring, I'm not sure how that will get
calculated.

The results will be top-k but still in primary key order. Scores are
computed based on vector similarly function.

Top-K search need two top-k filter as described in CEP.

> Second, if I am ordering the results like for a Vector Search and I want
to have the top 1 result.  How is the scoring done and what happens if
there are 20 that have the same score?  How will the coordinator decide
which 1 is returned out of 20?

It will be the row with smaller primary key order.

On Wed, 10 May 2023 at 05:39, Jeremy Hanna <je...@gmail.com>
wrote:

> Just wanted to add that I don't have any special knowledge of CEP-30
> beyond what Jonathan posted and just trying to help clarify and answer
> questions as I can with some knowledge and experience from DSE Search and
> SAI.  Thanks to Caleb for helping validate some things as well.  And to be
> clear about partial results - the default with DSE Search at least is to
> fail a query if it can't get the full token range coverage.  However there
> is an option to allow for shards being unavailable and return partial
> results.
>
> On May 9, 2023, at 3:38 PM, Jeremy Hanna <Je...@gmail.com>
> wrote:
>
> I talked to David and some others in slack to hopefully clarify:
>
> With SAI, can you have partial results?  When you have a query that is
> non-key based, you need to have full token range coverage of the results.
> If that isn't possible, will Vector Search/SAI return partial results?
>
> Anything can happen in the implementation, but for scoring, it may not
> make sense to return partial results because it's misleading.  For
> non-global queries, it could or couldn't return partial results depending
> on implementation/configuration.  In DSE you could have partial results
> depending on the options.   However I couldn't find partial results defined
> in CEP-7 or CEP-30.
>
> The other questions are about scoring.
>
> First, how is ordering/scoring done?
>
> Each replica returns back to the coordinator a sorted set of results and
> the coordinator will have to see all of the results globally in order to do
> a global ordering.  You can't know what the top result is unless you've
> seen everything.  As to the scoring, I'm not sure how that will get
> calculated.
>
> Second, if I am ordering the results like for a Vector Search and I want
> to have the top 1 result.  How is the scoring done and what happens if
> there are 20 that have the same score?  How will the coordinator decide
> which 1 is returned out of 20?
>
> It returns results in token/partition and then clustering order.
>
> On May 9, 2023, at 2:53 PM, Caleb Rackliffe <ca...@gmail.com>
> wrote:
>
> Anyone on this ML who still remembers DSE Search (or has experience w/
> Elastic or SolrCloud) probably also knows that there are some significant
> pieces of an optimized scatter/gather apparatus for IR (even without
> sorting, which also doesn't exist yet) that do not exist in C* or it's
> range query system (which SAI and all other 2i implementations use). SAI,
> like all C* 2i implementations, is still a local index, and as that is the
> case, anything built on it will perform best in partition-scoped (at least
> on the read side) use-cases. (On the bright side, the project is moving
> toward larger partitions being a possibility.) With smaller clusters or
> use-cases that are extremely write-heavy/read-light, it's possible that the
> full scatter/gather won't be too onerous, especially w/ a few small tweaks
> (on top of a non-vnode cluster) to a.) keep fanout minimal and b.) keep
> range/index queries to a single pass to minimize latency.
>
> Whatever we do, we just need to avoid a situation down the road where
> users don't understand these nuances and hit a wall where they try to use
> this in a way that is fundamentally incompatible w/ the way the database
> scales/works. (I've done my best to call this out in all discussions around
> SAI over time, and there may even end up being further guardrails put in
> place to make it even harder to misuse it...but I digress.)
>
> Having said all that, I don't fundamentally have a problem w/ the proposal.
>
> On Tue, May 9, 2023 at 2:11 PM Benedict <be...@apache.org> wrote:
>
>> HNSW can in principle be made into a distributed index. But that would be
>> quite a different paradigm to SAI.
>>
>> On 9 May 2023, at 19:30, Patrick McFadin <pm...@gmail.com> wrote:
>>
>> 
>> Under the goals section, there is this line:
>>
>>
>>    1. Scatter/gather across replicas, combining topK from each to get
>>    global topK.
>>
>>
>> But what I'm hearing is, exactly how will that happen? Maybe this is an
>> SAI question too. How is that verified in SAI?
>>
>> On Tue, May 9, 2023 at 11:07 AM David Capwell <dc...@apple.com> wrote:
>>
>>> Approach section doesn’t go over how this will handle cross replica
>>> search, this would be good to flesh out… given results have a real ranking,
>>> the current 2i logic may yield incorrect results… so would think we need
>>> num_ranges / rf queries in the best case, with some new capability to sort
>>> the results?  If my assumption is correct, then how errors are handled
>>> should also be fleshed out… Example: 1k cluster without vnode and RF=3, so
>>> 333 queries fanned out to match, then coordinator needs to sort… if 1 of
>>> the queries fails and can’t fall back to peers… does the query fail (I
>>> assume so)?
>>>
>>> On May 8, 2023, at 7:20 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> Following the recent discussion threads, I would like to propose CEP-30
>>> to add Approximate Nearest Neighbor (ANN) Vector Search via
>>> Storage-Attached Indexes (SAI) to Apache Cassandra.
>>>
>>> The primary goal of this proposal is to implement ANN vector search
>>> capabilities, making Cassandra more useful to AI developers and
>>> organizations managing large datasets that can benefit from fast similarity
>>> search.
>>>
>>> The implementation will leverage Lucene's Hierarchical Navigable Small
>>> World (HNSW) library and introduce a new CQL data type for vector
>>> embeddings, a new SAI index for ANN search functionality, and a new CQL
>>> operator for performing ANN search queries.
>>>
>>> We are targeting the 5.0 release for this feature, in conjunction with
>>> the release of SAI. The proposed changes will maintain compatibility with
>>> existing Cassandra functionality and compose well with the already-approved
>>> SAI features.
>>>
>>> Please find the full CEP document here:
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
>>>
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>>>
>>>
>>>
>
>

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

Posted by Jeremy Hanna <je...@gmail.com>.
Just wanted to add that I don't have any special knowledge of CEP-30 beyond what Jonathan posted and just trying to help clarify and answer questions as I can with some knowledge and experience from DSE Search and SAI.  Thanks to Caleb for helping validate some things as well.  And to be clear about partial results - the default with DSE Search at least is to fail a query if it can't get the full token range coverage.  However there is an option to allow for shards being unavailable and return partial results.

> On May 9, 2023, at 3:38 PM, Jeremy Hanna <Je...@gmail.com> wrote:
> 
> I talked to David and some others in slack to hopefully clarify:
> 
> With SAI, can you have partial results?  When you have a query that is non-key based, you need to have full token range coverage of the results.  If that isn't possible, will Vector Search/SAI return partial results?
> 
> Anything can happen in the implementation, but for scoring, it may not make sense to return partial results because it's misleading.  For non-global queries, it could or couldn't return partial results depending on implementation/configuration.  In DSE you could have partial results depending on the options.   However I couldn't find partial results defined in CEP-7 or CEP-30.
> 
> The other questions are about scoring.
> 
> First, how is ordering/scoring done?
> 
> Each replica returns back to the coordinator a sorted set of results and the coordinator will have to see all of the results globally in order to do a global ordering.  You can't know what the top result is unless you've seen everything.  As to the scoring, I'm not sure how that will get calculated.
> 
> Second, if I am ordering the results like for a Vector Search and I want to have the top 1 result.  How is the scoring done and what happens if there are 20 that have the same score?  How will the coordinator decide which 1 is returned out of 20?
> 
> It returns results in token/partition and then clustering order.
> 
>> On May 9, 2023, at 2:53 PM, Caleb Rackliffe <ca...@gmail.com> wrote:
>> 
>> Anyone on this ML who still remembers DSE Search (or has experience w/ Elastic or SolrCloud) probably also knows that there are some significant pieces of an optimized scatter/gather apparatus for IR (even without sorting, which also doesn't exist yet) that do not exist in C* or it's range query system (which SAI and all other 2i implementations use). SAI, like all C* 2i implementations, is still a local index, and as that is the case, anything built on it will perform best in partition-scoped (at least on the read side) use-cases. (On the bright side, the project is moving toward larger partitions being a possibility.) With smaller clusters or use-cases that are extremely write-heavy/read-light, it's possible that the full scatter/gather won't be too onerous, especially w/ a few small tweaks (on top of a non-vnode cluster) to a.) keep fanout minimal and b.) keep range/index queries to a single pass to minimize latency.
>> 
>> Whatever we do, we just need to avoid a situation down the road where users don't understand these nuances and hit a wall where they try to use this in a way that is fundamentally incompatible w/ the way the database scales/works. (I've done my best to call this out in all discussions around SAI over time, and there may even end up being further guardrails put in place to make it even harder to misuse it...but I digress.)
>> 
>> Having said all that, I don't fundamentally have a problem w/ the proposal.
>> 
>> On Tue, May 9, 2023 at 2:11 PM Benedict <benedict@apache.org <ma...@apache.org>> wrote:
>>> HNSW can in principle be made into a distributed index. But that would be quite a different paradigm to SAI.
>>> 
>>>> On 9 May 2023, at 19:30, Patrick McFadin <pmcfadin@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> 
>>>> Under the goals section, there is this line:
>>>> 
>>>> Scatter/gather across replicas, combining topK from each to get global topK.
>>>> 
>>>> But what I'm hearing is, exactly how will that happen? Maybe this is an SAI question too. How is that verified in SAI?
>>>> 
>>>> On Tue, May 9, 2023 at 11:07 AM David Capwell <dcapwell@apple.com <ma...@apple.com>> wrote:
>>>>> Approach section doesn’t go over how this will handle cross replica search, this would be good to flesh out… given results have a real ranking, the current 2i logic may yield incorrect results… so would think we need num_ranges / rf queries in the best case, with some new capability to sort the results?  If my assumption is correct, then how errors are handled should also be fleshed out… Example: 1k cluster without vnode and RF=3, so 333 queries fanned out to match, then coordinator needs to sort… if 1 of the queries fails and can’t fall back to peers… does the query fail (I assume so)?
>>>>> 
>>>>>> On May 8, 2023, at 7:20 PM, Jonathan Ellis <jbellis@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> Following the recent discussion threads, I would like to propose CEP-30 to add Approximate Nearest Neighbor (ANN) Vector Search via Storage-Attached Indexes (SAI) to Apache Cassandra.
>>>>>> 
>>>>>> The primary goal of this proposal is to implement ANN vector search capabilities, making Cassandra more useful to AI developers and organizations managing large datasets that can benefit from fast similarity search.
>>>>>> 
>>>>>> The implementation will leverage Lucene's Hierarchical Navigable Small World (HNSW) library and introduce a new CQL data type for vector embeddings, a new SAI index for ANN search functionality, and a new CQL operator for performing ANN search queries.
>>>>>> 
>>>>>> We are targeting the 5.0 release for this feature, in conjunction with the release of SAI. The proposed changes will maintain compatibility with existing Cassandra functionality and compose well with the already-approved SAI features.
>>>>>> 
>>>>>> Please find the full CEP document here: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
>>>>>> 
>>>>>> -- 
>>>>>> Jonathan Ellis
>>>>>> co-founder, http://www.datastax.com <http://www.datastax.com/>
>>>>>> @spyced
>>>>> 
> 


Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

Posted by Jeremy Hanna <je...@gmail.com>.
I talked to David and some others in slack to hopefully clarify:

With SAI, can you have partial results?  When you have a query that is non-key based, you need to have full token range coverage of the results.  If that isn't possible, will Vector Search/SAI return partial results?

Anything can happen in the implementation, but for scoring, it may not make sense to return partial results because it's misleading.  For non-global queries, it could or couldn't return partial results depending on implementation/configuration.  In DSE you could have partial results depending on the options.   However I couldn't find partial results defined in CEP-7 or CEP-30.

The other questions are about scoring.

First, how is ordering/scoring done?

Each replica returns back to the coordinator a sorted set of results and the coordinator will have to see all of the results globally in order to do a global ordering.  You can't know what the top result is unless you've seen everything.  As to the scoring, I'm not sure how that will get calculated.

Second, if I am ordering the results like for a Vector Search and I want to have the top 1 result.  How is the scoring done and what happens if there are 20 that have the same score?  How will the coordinator decide which 1 is returned out of 20?

It returns results in token/partition and then clustering order.

> On May 9, 2023, at 2:53 PM, Caleb Rackliffe <ca...@gmail.com> wrote:
> 
> Anyone on this ML who still remembers DSE Search (or has experience w/ Elastic or SolrCloud) probably also knows that there are some significant pieces of an optimized scatter/gather apparatus for IR (even without sorting, which also doesn't exist yet) that do not exist in C* or it's range query system (which SAI and all other 2i implementations use). SAI, like all C* 2i implementations, is still a local index, and as that is the case, anything built on it will perform best in partition-scoped (at least on the read side) use-cases. (On the bright side, the project is moving toward larger partitions being a possibility.) With smaller clusters or use-cases that are extremely write-heavy/read-light, it's possible that the full scatter/gather won't be too onerous, especially w/ a few small tweaks (on top of a non-vnode cluster) to a.) keep fanout minimal and b.) keep range/index queries to a single pass to minimize latency.
> 
> Whatever we do, we just need to avoid a situation down the road where users don't understand these nuances and hit a wall where they try to use this in a way that is fundamentally incompatible w/ the way the database scales/works. (I've done my best to call this out in all discussions around SAI over time, and there may even end up being further guardrails put in place to make it even harder to misuse it...but I digress.)
> 
> Having said all that, I don't fundamentally have a problem w/ the proposal.
> 
> On Tue, May 9, 2023 at 2:11 PM Benedict <benedict@apache.org <ma...@apache.org>> wrote:
>> HNSW can in principle be made into a distributed index. But that would be quite a different paradigm to SAI.
>> 
>>> On 9 May 2023, at 19:30, Patrick McFadin <pmcfadin@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> 
>>> Under the goals section, there is this line:
>>> 
>>> Scatter/gather across replicas, combining topK from each to get global topK.
>>> 
>>> But what I'm hearing is, exactly how will that happen? Maybe this is an SAI question too. How is that verified in SAI?
>>> 
>>> On Tue, May 9, 2023 at 11:07 AM David Capwell <dcapwell@apple.com <ma...@apple.com>> wrote:
>>>> Approach section doesn’t go over how this will handle cross replica search, this would be good to flesh out… given results have a real ranking, the current 2i logic may yield incorrect results… so would think we need num_ranges / rf queries in the best case, with some new capability to sort the results?  If my assumption is correct, then how errors are handled should also be fleshed out… Example: 1k cluster without vnode and RF=3, so 333 queries fanned out to match, then coordinator needs to sort… if 1 of the queries fails and can’t fall back to peers… does the query fail (I assume so)?
>>>> 
>>>>> On May 8, 2023, at 7:20 PM, Jonathan Ellis <jbellis@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> Following the recent discussion threads, I would like to propose CEP-30 to add Approximate Nearest Neighbor (ANN) Vector Search via Storage-Attached Indexes (SAI) to Apache Cassandra.
>>>>> 
>>>>> The primary goal of this proposal is to implement ANN vector search capabilities, making Cassandra more useful to AI developers and organizations managing large datasets that can benefit from fast similarity search.
>>>>> 
>>>>> The implementation will leverage Lucene's Hierarchical Navigable Small World (HNSW) library and introduce a new CQL data type for vector embeddings, a new SAI index for ANN search functionality, and a new CQL operator for performing ANN search queries.
>>>>> 
>>>>> We are targeting the 5.0 release for this feature, in conjunction with the release of SAI. The proposed changes will maintain compatibility with existing Cassandra functionality and compose well with the already-approved SAI features.
>>>>> 
>>>>> Please find the full CEP document here: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
>>>>> 
>>>>> -- 
>>>>> Jonathan Ellis
>>>>> co-founder, http://www.datastax.com <http://www.datastax.com/>
>>>>> @spyced
>>>> 


Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

Posted by Caleb Rackliffe <ca...@gmail.com>.
Anyone on this ML who still remembers DSE Search (or has experience w/
Elastic or SolrCloud) probably also knows that there are some significant
pieces of an optimized scatter/gather apparatus for IR (even without
sorting, which also doesn't exist yet) that do not exist in C* or it's
range query system (which SAI and all other 2i implementations use). SAI,
like all C* 2i implementations, is still a local index, and as that is the
case, anything built on it will perform best in partition-scoped (at least
on the read side) use-cases. (On the bright side, the project is moving
toward larger partitions being a possibility.) With smaller clusters or
use-cases that are extremely write-heavy/read-light, it's possible that the
full scatter/gather won't be too onerous, especially w/ a few small tweaks
(on top of a non-vnode cluster) to a.) keep fanout minimal and b.) keep
range/index queries to a single pass to minimize latency.

Whatever we do, we just need to avoid a situation down the road where users
don't understand these nuances and hit a wall where they try to use this in
a way that is fundamentally incompatible w/ the way the database
scales/works. (I've done my best to call this out in all discussions around
SAI over time, and there may even end up being further guardrails put in
place to make it even harder to misuse it...but I digress.)

Having said all that, I don't fundamentally have a problem w/ the proposal.

On Tue, May 9, 2023 at 2:11 PM Benedict <be...@apache.org> wrote:

> HNSW can in principle be made into a distributed index. But that would be
> quite a different paradigm to SAI.
>
> On 9 May 2023, at 19:30, Patrick McFadin <pm...@gmail.com> wrote:
>
> 
> Under the goals section, there is this line:
>
>
>    1. Scatter/gather across replicas, combining topK from each to get
>    global topK.
>
>
> But what I'm hearing is, exactly how will that happen? Maybe this is an
> SAI question too. How is that verified in SAI?
>
> On Tue, May 9, 2023 at 11:07 AM David Capwell <dc...@apple.com> wrote:
>
>> Approach section doesn’t go over how this will handle cross replica
>> search, this would be good to flesh out… given results have a real ranking,
>> the current 2i logic may yield incorrect results… so would think we need
>> num_ranges / rf queries in the best case, with some new capability to sort
>> the results?  If my assumption is correct, then how errors are handled
>> should also be fleshed out… Example: 1k cluster without vnode and RF=3, so
>> 333 queries fanned out to match, then coordinator needs to sort… if 1 of
>> the queries fails and can’t fall back to peers… does the query fail (I
>> assume so)?
>>
>> On May 8, 2023, at 7:20 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> Hi all,
>>
>> Following the recent discussion threads, I would like to propose CEP-30
>> to add Approximate Nearest Neighbor (ANN) Vector Search via
>> Storage-Attached Indexes (SAI) to Apache Cassandra.
>>
>> The primary goal of this proposal is to implement ANN vector search
>> capabilities, making Cassandra more useful to AI developers and
>> organizations managing large datasets that can benefit from fast similarity
>> search.
>>
>> The implementation will leverage Lucene's Hierarchical Navigable Small
>> World (HNSW) library and introduce a new CQL data type for vector
>> embeddings, a new SAI index for ANN search functionality, and a new CQL
>> operator for performing ANN search queries.
>>
>> We are targeting the 5.0 release for this feature, in conjunction with
>> the release of SAI. The proposed changes will maintain compatibility with
>> existing Cassandra functionality and compose well with the already-approved
>> SAI features.
>>
>> Please find the full CEP document here:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
>>
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>>
>>
>>

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

Posted by Benedict <be...@apache.org>.
HNSW can in principle be made into a distributed index. But that would be
quite a different paradigm to SAI.

  

> On 9 May 2023, at 19:30, Patrick McFadin <pm...@gmail.com> wrote:  
>  
>

> 
>
> Under the goals section, there is this line:
>
>  
>
>
>   1. Scatter/gather across replicas, combining topK from each to get global
> topK.
>

>
>  
>
>
> But what I'm hearing is, exactly how will that happen? Maybe this is an SAI
> question too. How is that verified in SAI?  
>
>
>  
>
>
> On Tue, May 9, 2023 at 11:07 AM David Capwell
> <[dcapwell@apple.com](mailto:dcapwell@apple.com)> wrote:  
>
>

>> Approach section doesn’t go over how this will handle cross replica search,
this would be good to flesh out… given results have a real ranking, the
current 2i logic may yield incorrect results… so would think we need
num_ranges / rf queries in the best case, with some new capability to sort the
results?  If my assumption is correct, then how errors are handled should also
be fleshed out… Example: 1k cluster without vnode and RF=3, so 333 queries
fanned out to match, then coordinator needs to sort… if 1 of the queries fails
and can’t fall back to peers… does the query fail (I assume so)?  
>
>>

>>  
>
>>

>>> On May 8, 2023, at 7:20 PM, Jonathan Ellis
<[jbellis@gmail.com](mailto:jbellis@gmail.com)> wrote:

>>>

>>>  
>
>>>

>>> Hi all,  
>  
> Following the recent discussion threads, I would like to propose CEP-30 to
> add Approximate Nearest Neighbor (ANN) Vector Search via Storage-Attached
> Indexes (SAI) to Apache Cassandra.  
>  
> The primary goal of this proposal is to implement ANN vector search
> capabilities, making Cassandra more useful to AI developers and
> organizations managing large datasets that can benefit from fast similarity
> search.  
>  
> The implementation will leverage Lucene's Hierarchical Navigable Small World
> (HNSW) library and introduce a new CQL data type for vector embeddings, a
> new SAI index for ANN search functionality, and a new CQL operator for
> performing ANN search queries.  
>  
> We are targeting the 5.0 release for this feature, in conjunction with the
> release of SAI. The proposed changes will maintain compatibility with
> existing Cassandra functionality and compose well with the already-approved
> SAI features.  
>  
> Please find the full CEP document here:
> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-
> Attached+Indexes>  
>  
> \--  
>
>>>

>>> Jonathan Ellis  
> co-founder, [http://www.datastax.com](http://www.datastax.com/)  
> @spyced  
>
>>

>>  
>


Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

Posted by Patrick McFadin <pm...@gmail.com>.
Under the goals section, there is this line:


   1. Scatter/gather across replicas, combining topK from each to get
   global topK.


But what I'm hearing is, exactly how will that happen? Maybe this is an SAI
question too. How is that verified in SAI?

On Tue, May 9, 2023 at 11:07 AM David Capwell <dc...@apple.com> wrote:

> Approach section doesn’t go over how this will handle cross replica
> search, this would be good to flesh out… given results have a real ranking,
> the current 2i logic may yield incorrect results… so would think we need
> num_ranges / rf queries in the best case, with some new capability to sort
> the results?  If my assumption is correct, then how errors are handled
> should also be fleshed out… Example: 1k cluster without vnode and RF=3, so
> 333 queries fanned out to match, then coordinator needs to sort… if 1 of
> the queries fails and can’t fall back to peers… does the query fail (I
> assume so)?
>
> On May 8, 2023, at 7:20 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>
> Hi all,
>
> Following the recent discussion threads, I would like to propose CEP-30 to
> add Approximate Nearest Neighbor (ANN) Vector Search via Storage-Attached
> Indexes (SAI) to Apache Cassandra.
>
> The primary goal of this proposal is to implement ANN vector search
> capabilities, making Cassandra more useful to AI developers and
> organizations managing large datasets that can benefit from fast similarity
> search.
>
> The implementation will leverage Lucene's Hierarchical Navigable Small
> World (HNSW) library and introduce a new CQL data type for vector
> embeddings, a new SAI index for ANN search functionality, and a new CQL
> operator for performing ANN search queries.
>
> We are targeting the 5.0 release for this feature, in conjunction with the
> release of SAI. The proposed changes will maintain compatibility with
> existing Cassandra functionality and compose well with the already-approved
> SAI features.
>
> Please find the full CEP document here:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
>
>

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

Posted by David Capwell <dc...@apple.com>.
Approach section doesn’t go over how this will handle cross replica search, this would be good to flesh out… given results have a real ranking, the current 2i logic may yield incorrect results… so would think we need num_ranges / rf queries in the best case, with some new capability to sort the results?  If my assumption is correct, then how errors are handled should also be fleshed out… Example: 1k cluster without vnode and RF=3, so 333 queries fanned out to match, then coordinator needs to sort… if 1 of the queries fails and can’t fall back to peers… does the query fail (I assume so)?

> On May 8, 2023, at 7:20 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> 
> Hi all,
> 
> Following the recent discussion threads, I would like to propose CEP-30 to add Approximate Nearest Neighbor (ANN) Vector Search via Storage-Attached Indexes (SAI) to Apache Cassandra.
> 
> The primary goal of this proposal is to implement ANN vector search capabilities, making Cassandra more useful to AI developers and organizations managing large datasets that can benefit from fast similarity search.
> 
> The implementation will leverage Lucene's Hierarchical Navigable Small World (HNSW) library and introduce a new CQL data type for vector embeddings, a new SAI index for ANN search functionality, and a new CQL operator for performing ANN search queries.
> 
> We are targeting the 5.0 release for this feature, in conjunction with the release of SAI. The proposed changes will maintain compatibility with existing Cassandra functionality and compose well with the already-approved SAI features.
> 
> Please find the full CEP document here: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
> 
> -- 
> Jonathan Ellis
> co-founder, http://www.datastax.com <http://www.datastax.com/>
> @spyced