You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Xin-Chun Zhang (Jira)" <ji...@apache.org> on 2020/03/06 17:21:00 UTC

[jira] [Issue Comment Deleted] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

     [ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xin-Chun Zhang updated LUCENE-9136:
-----------------------------------
    Comment: was deleted

(was: Hi, [~jtibshirani], thanks for your suggestions!

??"I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms."??

In the previous implementation ([https://github.com/irvingzhang/lucene-solr/commit/eb5f79ea7a705595821f73f80a0c5752061869b2]), the cluster information is divided into two parts – meta (.ifi) and data(.ifd) as shown in the following figure, where each cluster with a postings list is stored in the data file (.ifd) and not kept on-heap. A major concern of this implementation is its reading performance of cluster data since reading is a very frequent behavior on kNN search. I will test and check the performance. 

!image-2020-02-16-15-05-02-451.png!

??"Because of this concern, it could be nice to include benchmarks for index time (in addition to QPS)..."??

Many thanks! I will check the links you mentioned and consider optimize the clustering cost. In addition, more benchmarks will be added soon.

 
h2. *UPDATE – Feb. 24, 2020*

I have  add a new implementation for IVF index, which has been marked as ***V2 under the package org.apache.lucene.codecs.lucene90. In current implementation, the IVF index has been divided into two files with suffixes .ifi and .ifd, respectively. The .ifd file will be read if cluster information is needed. The experiments are conducted on dataset sift1M (Test codes: [https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/KnnIvfPerformTester.java]), detailed results are as follows,
 # add document -- 3921 ms;
 # commit -- 3912286 ms (mainly spent on k-means training, 10 iterations, 4000 centroids, totally 512,000 vectors used for training);
 # R@100 recall time and recall ratio are listed in the following table

 
||nprobe||avg. search time (ms)||recall ratio (%)||
|8|28.0755|44.154|
|16|27.1745|57.9945|
|32|32.986|71.7003|
|64|40.4082|83.50471|
|128|50.9569|92.07929|
|256|73.923|97.150894|

 Compare with on-heap implementation of IVF index, the query time increases significantly (22%~71%). Actually, IVF index is comprised of unique docIDs, and will not take up too much memory. *There is a small argument about whether to keep the cluster information on-heap or not. Hope to hear more suggestions.*

 

 )

> Introduce IVFFlat to Lucene for ANN similarity search
> -----------------------------------------------------
>
>                 Key: LUCENE-9136
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9136
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Xin-Chun Zhang
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the machine learning space for decades but it draws tremendous attention lately with the emergence of deep learning. The central problem of RL is to determine an optimal representation of the input data. By embedding the data into a high dimensional vector, the vector retrieval (VR) method is then applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has been used extensively in industry from online advertising to computer vision and speech recognition. There exist many open source implementations of VR algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various choices for potential users. However, the aforementioned implementations are all written in C++, and no plan for supporting Java interface, making it hard to be integrated in Java projects or those who are not familier with C/C++  [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, while HNSW performs better in general scenarios including recommendation and personalized advertisement. *The recall ratio of IVFFlat could be gradually increased by adjusting the query parameter (nprobe), while it's hard for HNSW to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, LUCENE-9004) for Lucene, has made great progress. The issue draws attention of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also very popular with many users and supporters. Compared with HNSW, IVFFlat has smaller index size but requires k-means clustering, while HNSW is faster in query (no training required) but requires extra storage for saving graphs [indexing 1M vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. Another advantage is that IVFFlat can be faster and more accurate when enables GPU parallel computing (current not support in Java). Both algorithms have their merits and demerits. Since HNSW is now under development, it may be better to provide both implementations (HNSW && IVFFlat) for potential users who are faced with very different scenarios and want to more choices.
> The latest branch is [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org