You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Jonathan Ellis <jb...@gmail.com> on 2023/05/09 17:25:32 UTC

Re: HNSW questions

I don't see anything to make sure vectors are unique in IndexingChain down
to FieldWriter, is that handled somewhere else?  Or is it just up to the
user to make sure no documents end up with duplicate vectors?

On Wed, Apr 19, 2023 at 5:07 AM Michael Sokolov <ms...@gmail.com> wrote:

> Oh identical vectors. Basically unsupported. If you create a large index
> filled with identical vectors it leads to pathological behavior. Seems to
> be a weakness in the algorithm. If you have any idea how to improve that,
> it would be welcome. But in real world scenarios, it doesn't seem to arise?
>
> On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis <jb...@gmail.com> wrote:
>
>> HI all, a couple questions on how HNSW works:
>>
>> 1. What is driving the requirement for two copies of the input vectors?
>> It looks like the RAVV implementations do shallow copies, so the vector
>> from A is the same that would be returned by B.  What am I missing?
>>
>> 2. What is the intended behavior when adding identical vectors to a
>> HNSW?  It looks like when I supply 10 identical vectors, they all get added
>> to the graph, but when I search for the nearest neighbors, I only get one
>> of them in the result set.
>>
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>>
>

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: HNSW questions

Posted by Michael Sokolov <ms...@gmail.com>.
Yes, it's up to the application. And it is definitely a pathological
case when it happens; https://github.com/apache/lucene/issues/11626

On Tue, May 9, 2023 at 1:30 PM Jonathan Ellis <jb...@gmail.com> wrote:
>
> I don't see anything to make sure vectors are unique in IndexingChain down to FieldWriter, is that handled somewhere else?  Or is it just up to the user to make sure no documents end up with duplicate vectors?
>
> On Wed, Apr 19, 2023 at 5:07 AM Michael Sokolov <ms...@gmail.com> wrote:
>>
>> Oh identical vectors. Basically unsupported. If you create a large index filled with identical vectors it leads to pathological behavior. Seems to be a weakness in the algorithm. If you have any idea how to improve that, it would be welcome. But in real world scenarios, it doesn't seem to arise?
>>
>> On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis <jb...@gmail.com> wrote:
>>>
>>> HI all, a couple questions on how HNSW works:
>>>
>>> 1. What is driving the requirement for two copies of the input vectors?  It looks like the RAVV implementations do shallow copies, so the vector from A is the same that would be returned by B.  What am I missing?
>>>
>>> 2. What is the intended behavior when adding identical vectors to a HNSW?  It looks like when I supply 10 identical vectors, they all get added to the graph, but when I search for the nearest neighbors, I only get one of them in the result set.
>>>
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org