You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/02/27 12:09:13 UTC

[GitHub] [lucene-solr] irvingzhang opened a new pull request #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers

irvingzhang opened a new pull request #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers
URL: https://github.com/apache/lucene-solr/pull/1295
 
 
   `if (dist < f.distance() || results.size() < ef) {
      Neighbor n = new ImmutableNeighbor(e.docId(), dist);
      candidates.add(n);
      results.insertWithOverflow(n);
      f = results.top();
   }`
   
   If (dist < f.distance()) but results.size() >= ef, the "Neighbor n" would be added to "results" ("results" is a sub-type of PriorityQueue). The actual size of "results" would be between "ef" and results' max queue size, while its expected size if "ef". 
   
   Consider the following situation:
   
   `FurthestNeighbors neighbors = new FurthestNeighbors(ef, ep);
     for (int l = hnsw.topLevel(); l > 0; l--) {
       visitedCount += hnsw.searchLayer(query, neighbors, 1, l, vectorValues);
     }
     visitedCount += hnsw.searchLayer(query, neighbors, ef, 0, vectorValues);`
   
   where the max size of "neighbors" ("neighbors" is also a sub-type of PriorityQueue) is ef (assume ef > 1). When search over a non-zero layer, we are going to find the nearest one neighbor by `hnsw.searchLayer(query, neighbors, 1, l, vectorValues);`, where l is the layer and layer > 0. The actual size of "neighbors" may be larger than 1.
   
   Assume that "results.size() <= ef", I think "results.pop();" when "results.size() == ef" can solve this problem.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] irvingzhang edited a comment on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers

Posted by GitBox <gi...@apache.org>.
irvingzhang edited a comment on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers
URL: https://github.com/apache/lucene-solr/pull/1295#issuecomment-592287771
 
 
   > I believe in practice that results. max size is always set to ef, so there shouldn't be any real issue. I agree that the interface doesn't make that plain; we should enforce this invariant by API contract
   
   I agree that the max size is always set to _ef_, but _ef_ has different values in different layers. 
   According to **Algorithm 5** of [papar](https://arxiv.org/pdf/1603.09320.pdf),  HNSW searches the nearest one (namely, _ef_=1) neighbor from top layer to the 1st layer,  and then finds the nearest _ef_ (_ef_=topK) neighbors from layer 0. In the implementation of Lucene HNSW, the actual size of result queue (Line 64, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)) is set to _ef_=topK when searching from top layer to the 1st layer, result in finding more neighbors than expected. Even if the parameter _ef_ is set to 1 in Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java), the code `if (dist < f.distance() || results.size() < ef)` (Line 87, [HNSWGraph](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraph.java)) allows insert more than 1 neighbor to the "results" when `dist < f.distance()` because the max size of "results" is _ef_=topK, which implies that actual size of "results" belongs to [1, topK].
   
   The simplest way to check this problem is to print the actual size of neighbors. For example, add "System.out.println(neighbors.size());" after "visitedCount += hnsw.searchLayer(query, neighbors, 1, l, vectorValues);" (Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)), where the nearest one neighbor is expected, but the actual neighbor size would be range from 1~topK. Which also applies to [HNSWGraphWriter](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphWriter.java).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] irvingzhang edited a comment on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers

Posted by GitBox <gi...@apache.org>.
irvingzhang edited a comment on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers
URL: https://github.com/apache/lucene-solr/pull/1295#issuecomment-592287771
 
 
   > I believe in practice that results. max size is always set to ef, so there shouldn't be any real issue. I agree that the interface doesn't make that plain; we should enforce this invariant by API contract
   
   I agree that the max size is always set to _ef_, but _ef_ has different values in different layers. 
   According to **Algorithm 5** of [papar](https://arxiv.org/pdf/1603.09320.pdf),  HNSW searches the nearest one (namely, _ef_=1) neighbor from top layer to the 1st layer,  and then finds the nearest _ef_ (_ef_=topK) neighbors from layer 0. In the implementation of Lucene HNSW, the actual size of result queue (Line 64, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)) is set to _ef_=topK when searching from top layer to the 1st layer, result in finding more neighbors than expected. Even if the parameter _ef_ is set to 1 in Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java), the code `if (dist < f.distance() || results.size() < ef)` (Line 87, [HNSWGraph](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraph.java)) allows inserting more than 1 neighbor to the "results" when `dist < f.distance()` but `results.size() >= ef` (here _ef_=1, corresponding to  Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)) because the max size of "results" is topK, which implies that actual size of "results" belongs to [1, topK].
   
   The simplest way to check this problem is to print the actual size of neighbors. For example, add "System.out.println(neighbors.size());" after "visitedCount += hnsw.searchLayer(query, neighbors, 1, l, vectorValues);" (Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)), where the nearest one neighbor is expected, but the actual neighbor size would be range from 1~topK. Which also applies to [HNSWGraphWriter](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphWriter.java).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers

Posted by GitBox <gi...@apache.org>.
msokolov commented on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers
URL: https://github.com/apache/lucene-solr/pull/1295#issuecomment-593096201
 
 
   Ah, I see you're right @irvingzhang . I think we could also save something by eliminating the priority queue for this case - it's silly to use an (effectively) 1-length queue, when all we need is a variable, but this does make the implementation match the algorithm: I'll merge 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] irvingzhang edited a comment on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers

Posted by GitBox <gi...@apache.org>.
irvingzhang edited a comment on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers
URL: https://github.com/apache/lucene-solr/pull/1295#issuecomment-592287771
 
 
   > I believe in practice that results. max size is always set to ef, so there shouldn't be any real issue. I agree that the interface doesn't make that plain; we should enforce this invariant by API contract
   
   I agree that the max size is always set to _ef_, but _ef_ has different values in different layers. 
   According to **Algorithm 5** of [papar](https://arxiv.org/pdf/1603.09320.pdf),  HNSW searches the nearest one (namely, _ef_=1) neighbor from top layer to the 1st layer,  and then finds the nearest _ef_ (_ef_=topK) neighbors from layer 0. In the implementation of Lucene HNSW, the actual size of result queue (Line 63, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)) is set to _ef_=topK when searching from top layer to the 1st layer, result in finding more neighbors than expected. Even if the parameter _ef_ is set to 1 in Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java), the code `if (dist < f.distance() || results.size() < ef)` (Line 87, [HNSWGraph](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraph.java)) allows insert more than 1 neighbors to the "results" when `dist < f.distance()` because the max size of "results" is _ef_=topK.
   
   The simplest way to check this problem is to print the actual size of neighbors. For example, add "System.out.println(neighbors.size());" after "visitedCount += hnsw.searchLayer(query, neighbors, 1, l, vectorValues);" (Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)), where the nearest one neighbor is expected, but the actual neighbor size would be range from 1~topK. Which also applies to [HNSWGraphWriter](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphWriter.java).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] irvingzhang edited a comment on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers

Posted by GitBox <gi...@apache.org>.
irvingzhang edited a comment on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers
URL: https://github.com/apache/lucene-solr/pull/1295#issuecomment-592287771
 
 
   > I believe in practice that results. max size is always set to ef, so there shouldn't be any real issue. I agree that the interface doesn't make that plain; we should enforce this invariant by API contract
   
   Hi, @msokolov, I agree that the max size is always set to _ef_, but _ef_ has different values in different layers. 
   According to **Algorithm 5** of Yury's [papar](https://arxiv.org/pdf/1603.09320.pdf),  HNSW searches the nearest one neighbor (namely, _ef_=1) from the top layer to the 1st layer,  and then finds the nearest _ef_ (_ef_=topK) neighbors from layer 0. In the implementation of Lucene HNSW, the actual size of result queue (Line 64, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)) is set to _ef_=topK when searching from top layer to the 1st layer while expected neighbor size is 1, result in finding more neighbors than expected. Even if the parameter _ef_ is set to 1 in Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java), the condition `if (dist < f.distance() || results.size() < ef)` (Line 87, [HNSWGraph](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraph.java)) allows inserting more than 1 neighbor to the "results" queue when `dist < f.distance()` and `results.size() >= ef` (here _ef_=1, corresponding to  Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)) because the max size of "results" is topK, which implies that the actual size of "results" queue belongs to [1, topK].
   
   **The simplest way to verify this problem is to print the actual size of neighbors**. For example, add "System.out.println(neighbors.size());" after "visitedCount += hnsw.searchLayer(query, neighbors, 1, l, vectorValues);" (Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)), where the nearest one neighbor is expected, but the printed neighbor size would be range from 1~topK. Which also applies to [HNSWGraphWriter](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphWriter.java).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] irvingzhang commented on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers

Posted by GitBox <gi...@apache.org>.
irvingzhang commented on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers
URL: https://github.com/apache/lucene-solr/pull/1295#issuecomment-592287771
 
 
   > I believe in practice that results. max size is always set to ef, so there shouldn't be any real issue. I agree that the interface doesn't make that plain; we should enforce this invariant by API contract
   
   I agree that the max size is always set to _ef_. 
   According to **Algorithm 5** of [papar](https://arxiv.org/pdf/1603.09320.pdf),  HNSW searches the nearest one (namely, _ef_=1) neighbor from top layer to the 1st layer,  and then finds the nearest _ef_ (_ef_=topK) neighbors from layer 0. In the implementation of Lucene HNSW, the actual size of result queue (Line 63, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)) is set to _ef_=topK when searching from top layer to the 1st layer, result in finding more neighbors than expected. Even if the parameter _ef_ is set to 1 in Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java), the code `if (dist < f.distance() || results.size() < ef)` (Line 87, [HNSWGraph](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraph.java)) allows insert more than 1 neighbors to the "results" when `dist < f.distance()` because the max size of "results" is _ef_=topK.
   
   The simplest way to check this problem is to print the actual size of neighbors. For example, add "System.out.println(neighbors.size());" after "visitedCount += hnsw.searchLayer(query, neighbors, 1, l, vectorValues);" (Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)), where the nearest one neighbor is expected, but the actual neighbor size would be range from 1~topK. Which also applies to [HNSWGraphWriter](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphWriter.java).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] irvingzhang edited a comment on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers

Posted by GitBox <gi...@apache.org>.
irvingzhang edited a comment on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers
URL: https://github.com/apache/lucene-solr/pull/1295#issuecomment-592287771
 
 
   > I believe in practice that results. max size is always set to ef, so there shouldn't be any real issue. I agree that the interface doesn't make that plain; we should enforce this invariant by API contract
   
   I agree that the max size is always set to _ef_, but _ef_ has different values in different layers. 
   According to **Algorithm 5** of [papar](https://arxiv.org/pdf/1603.09320.pdf),  HNSW searches the nearest one (namely, _ef_=1) neighbor from top layer to the 1st layer,  and then finds the nearest _ef_ (_ef_=topK) neighbors from layer 0. In the implementation of Lucene HNSW, the actual size of result queue (Line 63, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)) is set to _ef_=topK when searching from top layer to the 1st layer, result in finding more neighbors than expected. Even if the parameter _ef_ is set to 1 in Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java), the code `if (dist < f.distance() || results.size() < ef)` (Line 87, [HNSWGraph](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraph.java)) allows insert more than 1 neighbor to the "results" when `dist < f.distance()` because the max size of "results" is _ef_=topK, which implies that actual size of "results" belongs to [1, topK].
   
   The simplest way to check this problem is to print the actual size of neighbors. For example, add "System.out.println(neighbors.size());" after "visitedCount += hnsw.searchLayer(query, neighbors, 1, l, vectorValues);" (Line 66, [HNSWGraphReader](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java)), where the nearest one neighbor is expected, but the actual neighbor size would be range from 1~topK. Which also applies to [HNSWGraphWriter](https://github.com/apache/lucene-solr/blob/jira/lucene-9004-aknn-2/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphWriter.java).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov merged pull request #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers

Posted by GitBox <gi...@apache.org>.
msokolov merged pull request #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers
URL: https://github.com/apache/lucene-solr/pull/1295
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers

Posted by GitBox <gi...@apache.org>.
msokolov commented on issue #1295: Lucene-9004: bug fix for searching the nearest one neighbor in higher layers
URL: https://github.com/apache/lucene-solr/pull/1295#issuecomment-591970917
 
 
   I believe in practice that results. max size is always set to ef, so there shouldn't be any real issue. I agree that the interface doesn't make that plain; we should enforce this invariant by API contract

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org