You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/02/03 05:20:26 UTC

[GitHub] [lucene] jtibshirani opened a new pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

jtibshirani opened a new pull request #641:
URL: https://github.com/apache/lucene/pull/641


   A couple of the data structures used in HNSW search are pretty large and
   expensive to allocate. This commit creates a shared candidates queue and visited
   set that are reused across calls to HnswGraph#searchLevel. Now the same data
   structures are used for building the entire graph, which can really cut down on
   allocations during indexing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #641:
URL: https://github.com/apache/lucene/pull/641#discussion_r798223589



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -93,6 +94,29 @@
     }
   }
 
+  /**
+   * Holds data structures that are used by each {@link #searchLevel} call. These can be expensive
+   * to allocate, so whenever possible they're cleared and reused across calls.
+   */
+  static class HnswSearchState {
+    final NeighborQueue candidates;

Review comment:
       I didn't include the `results` queue because it's typically a lot smaller and I thought it made the logic less clear (results are returned but also passed in as scratch state?)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on pull request #641:
URL: https://github.com/apache/lucene/pull/641#issuecomment-1029566793


   I filed https://issues.apache.org/jira/browse/LUCENE-10404 so we don't forget about the hash set idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

Posted by GitBox <gi...@apache.org>.

msokolov commented on pull request #641:
URL: https://github.com/apache/lucene/pull/641#issuecomment-1030105715


   re: sharing across searches I agree it gets tricky, but can be bounded RAM if we use some kind of LRU cache as a pool, and in this case the only "compatibility" check is the size of the bitset. But yeah, definitely more complex, we should only do it if there is a good speedup


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [GitHub] [lucene] msokolov commented on pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

Posted by Michael Sokolov <ms...@gmail.com>.

Ah nvm I saw your later comment to the effect that we have a copy we can
use already

On Fri, Feb 4, 2022, 8:57 AM GitBox <gi...@apache.org> wrote:

>
> msokolov commented on pull request #641:
> URL: https://github.com/apache/lucene/pull/641#issuecomment-1030007403
>
>
>    I think we cannot use  intinthashset in core since it's an external
>    dependency? Did it work for you though?
>
>    On Thu, Feb 3, 2022, 1:36 PM Julie Tibshirani ***@***.***>
>    wrote:
>
>    > Extracting HnswGraphSearcher is a lot nicer, I pushed that refactor.
>    >
>    > On the topic of hash sets, I tried switching to IntIntHashMap on top
> of
>    > this PR and it gives a nice indexing speed-up. I can open a follow-up
> PR.
>    >
>    > About reusing data structures across searches -- I've heard that some
>    > other HNSW implementations found this to be beneficial, specifically
>    > maintaining a shared pool of "visited" sets for searches to use. It
> could
>    > indeed be complex though, I'd be curious to understand the the
> magnitude of
>    > the improvement.
>    >
>    > —
>    > Reply to this email directly, view it on GitHub
>    > <https://github.com/apache/lucene/pull/641#issuecomment-1029284791>,
> or
>    > unsubscribe
>    > <
> https://github.com/notifications/unsubscribe-auth/AAHHUQJGMUPVH4ZEGND6IB3UZLDKNANCNFSM5NN57KCQ
> >
>    > .
>    > Triage notifications on the go with GitHub Mobile for iOS
>    > <
> https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
> >
>    > or Android
>    > <
> https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub
> >.
>    >
>    > You are receiving this because you commented.Message ID:
>    > ***@***.***>
>    >
>
>
>
> --
> This is an automated message from the Apache Git Service.
> To respond to the message, please log on to GitHub and use the
> URL above to go to the specific comment.
>
> To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
>
> For queries about this service, please contact Infrastructure at:
> users@infra.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
> For additional commands, e-mail: issues-help@lucene.apache.org
>
>

[GitHub] [lucene] msokolov commented on pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

Posted by GitBox <gi...@apache.org>.

msokolov commented on pull request #641:
URL: https://github.com/apache/lucene/pull/641#issuecomment-1030007403


   I think we cannot use  intinthashset in core since it's an external
   dependency? Did it work for you though?
   
   On Thu, Feb 3, 2022, 1:36 PM Julie Tibshirani ***@***.***>
   wrote:
   
   > Extracting HnswGraphSearcher is a lot nicer, I pushed that refactor.
   >
   > On the topic of hash sets, I tried switching to IntIntHashMap on top of
   > this PR and it gives a nice indexing speed-up. I can open a follow-up PR.
   >
   > About reusing data structures across searches -- I've heard that some
   > other HNSW implementations found this to be beneficial, specifically
   > maintaining a shared pool of "visited" sets for searches to use. It could
   > indeed be complex though, I'd be curious to understand the the magnitude of
   > the improvement.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/lucene/pull/641#issuecomment-1029284791>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAHHUQJGMUPVH4ZEGND6IB3UZLDKNANCNFSM5NN57KCQ>
   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
   > or Android
   > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   >
   > You are receiving this because you commented.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on pull request #641:
URL: https://github.com/apache/lucene/pull/641#issuecomment-1030190957


   > I think we cannot use  intinthashset in core 
   
   I used `IntIntHashMap` as a quick test, which was recently copied in from hppc to `org.apache.lucene.util.hppc`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #641:
URL: https://github.com/apache/lucene/pull/641#discussion_r799027275



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java
##########
@@ -93,6 +96,11 @@ public HnswGraphBuilder(
     this.random = new SplittableRandom(seed);
     int levelOfFirstNode = getRandomGraphLevel(ml, random);
     this.hnsw = new HnswGraph(maxConn, levelOfFirstNode);
+    this.graphSearcher =
+        new HnswGraphSearcher(
+            similarityFunction,
+            new NeighborQueue(beamWidth, similarityFunction.reversed == false),
+            new FixedBitSet(vectorValues.size()));

Review comment:
       @jtibshirani I am wondering what is the reasoning of using `FixedBitSet` during graph construction?   Are we expecting to visit most of the graph nodes ( I guess that's true on a smaller graphs)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on pull request #641:
URL: https://github.com/apache/lucene/pull/641#issuecomment-1028617830


   This PR contains two changes:
   * Reuse `visited` bit set across calls. For graph construction, switch to `FixedBitSet` instead of `SparseBitSet`.
   * Reuse `candidates` queue across calls.
   
   They both seem to have a small positive impact on indexing. Here's an example using `KnnGraphTester` with `glove-100-angular`:
   
   **Baseline:** 788966 msec to write vectors
   **Only reuse visited set:** 768152 msec to write vectors
   **PR (reuse both visited and candidates):** 760896 msec to write vectors
   
   The PR also shows a small search improvement:
   
   **Baseline** 
   ```
   Recall    QPS
   0.741     2515.710
   0.814     1589.089
   0.924      459.971
   ```
   
   **PR (reuse both visited and candidates)**
   ```
   Recall    QPS
   0.740     2569.156
   0.813     1626.779
   0.925      468.207
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani edited a comment on pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

Posted by GitBox <gi...@apache.org>.

jtibshirani edited a comment on pull request #641:
URL: https://github.com/apache/lucene/pull/641#issuecomment-1029284791


   Extracting `HnswGraphSearcher` is a lot nicer, I will push a refactor.
   
   On the topic of hash sets, I tried switching to `IntIntHashMap` on top of this PR and it gives a nice indexing speed-up. I can open a follow-up PR.
   
   About reusing data structures across searches -- I've heard that some other HNSW implementations found this to be beneficial, specifically maintaining a shared pool of "visited" sets for searches to use. It could indeed be complex though, I'd be curious to understand the the magnitude of the improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

Posted by GitBox <gi...@apache.org>.

msokolov commented on pull request #641:
URL: https://github.com/apache/lucene/pull/641#issuecomment-1029006052


   Thanks for working on this, and it shows a nice result. I had been discussing with some folks at work and we were considering whether it would be possible to maintain the state even longer on the search side by pooling such states in the codec. Maybe we can try in a follow-up issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

Posted by GitBox <gi...@apache.org>.

jpountz commented on pull request #641:
URL: https://github.com/apache/lucene/pull/641#issuecomment-1029048199


   I've become a bit cautious with codec-level caching. E.g. we have threadlocals for stored fields which end up storing `num_search_threads * num_indices * num_segments_per_index` states overall, which is not always negligible depending on the size of the state. It's possible to try to share across fields or segments, but you need to pay attention to the case when they might use different file formats, which introduces complexity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani merged pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

Posted by GitBox <gi...@apache.org>.

jtibshirani merged pull request #641:
URL: https://github.com/apache/lucene/pull/641


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #641:
URL: https://github.com/apache/lucene/pull/641#discussion_r799057820



##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java
##########
@@ -93,6 +96,11 @@ public HnswGraphBuilder(
     this.random = new SplittableRandom(seed);
     int levelOfFirstNode = getRandomGraphLevel(ml, random);
     this.hnsw = new HnswGraph(maxConn, levelOfFirstNode);
+    this.graphSearcher =
+        new HnswGraphSearcher(
+            similarityFunction,
+            new NeighborQueue(beamWidth, similarityFunction.reversed == false),
+            new FixedBitSet(vectorValues.size()));

Review comment:
       I used a `FixedBitSet` here because the insertion and access operations are faster. Also, using `SparseFixedBitSet` here doesn't actually really save memory allocations, since during the graph construction we're expected to visit each node at least once.
   
   I tried an experiment where we shared the `visited` set but kept `SparseFixedBitSet` and it didn't help performance. I also tried switching to `FixedBitSet` during search and it hurt performance, probably because we are allocating a very large array once per search.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #641: LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on pull request #641:
URL: https://github.com/apache/lucene/pull/641#issuecomment-1029284791


   Extracting `HnswGraphSearcher` is a lot nicer, I pushed that refactor.
   
   On the topic of hash sets, I tried switching to `IntIntHashMap` on top of this PR and it gives a nice indexing speed-up. I can open a follow-up PR.
   
   About reusing data structures across searches -- I've heard that some other HNSW implementations found this to be beneficial, specifically maintaining a shared pool of "visited" sets for searches to use. It could indeed be complex though, I'd be curious to understand the the magnitude of the improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org