You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/03/01 10:25:12 UTC

[GitHub] [lucene-solr] irvingzhang edited a comment on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers

irvingzhang edited a comment on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers
URL: https://github.com/apache/lucene-solr/pull/1295#issuecomment-592287771
 
 
   > I believe in practice that results. max size is always set to ef, so there shouldn't be any real issue. I agree that the interface doesn't make that plain; we should enforce this invariant by API contract
   
   Hi, @msokolov, I agree that the max size is always set to _ef_, but _ef_ has different values in different layers. 
   According to **Algorithm 5** of Yury's [papar](https://arxiv.org/pdf/1603.09320.pdf),  HNSW searches the nearest one neighbor (namely, _ef_=1) from the top layer to the 1st layer,  and then finds the nearest _ef_ (_ef_=topK) neighbors from layer 0. In the implementation of Lucene HNSW, the actual size of result queue (Line 64, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)) is set to _ef_=topK when searching from top layer to the 1st layer while expected neighbor size is 1, result in finding more neighbors than expected. Even if the parameter _ef_ is set to 1 in Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java), the condition `if (dist < f.distance() || results.size() < ef)` (Line 87, [HNSWGraph](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraph.java)) allows inserting more than 1 neighbor to the "results" queue when `dist < f.distance()` and `results.size() >= ef` (here _ef_=1, corresponding to  Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)) because the max size of "results" is topK, which implies that the actual size of "results" queue belongs to [1, topK].
   
   **The simplest way to verify this problem is to print the actual size of neighbors**. For example, add "System.out.println(neighbors.size());" after "visitedCount += hnsw.searchLayer(query, neighbors, 1, l, vectorValues);" (Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)), where the nearest one neighbor is expected, but the printed neighbor size would be range from 1~topK. Which also applies to [HNSWGraphWriter](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphWriter.java).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org