You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Dmitry Kan (Jira)" <ji...@apache.org> on 2021/07/03 16:10:00 UTC

[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm

    [ https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17374070#comment-17374070 ] 

Dmitry Kan commented on LUCENE-9905:
------------------------------------

Usually, KNN is referred to the top K nearest neighbors found by an algorithm. So it is an exact search in that sense.
ANN -- is approximate nearest neighbors, inexact, and usually faster, trading accuracy for speed. HNSW is ANN algorithm in this notation. Hope this helps.

> Revise approach to specifying NN algorithm
> ------------------------------------------
>
>                 Key: LUCENE-9905
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9905
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: main (9.0)
>            Reporter: Julie Tibshirani
>            Priority: Blocker
>             Fix For: main (9.0)
>
>          Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a particular nearest-neighbor search data structure and algorithm. This flexibility is important since NN search is a developing area and we'd like to be able to experiment and evolve the algorithm. Right now we only have one algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation is expected to handle multiple algorithms. Instead we could have one format implementation per algorithm. Our current implementation would be HNSW-specific like {{HnswVectorFormat}}, and to experiment with another algorithm you could create a new implementation like {{ClusterVectorFormat}}. This would be better aligned with the codec framework, and help avoid exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric vector per document, as well as constructing + storing a HNSW graph. When adding another implementation, it’d be nice to be able to reuse logic for reading/ writing numeric vectors. I don’t think we need to design for this right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org