You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "benwtrent (via GitHub)" <gi...@apache.org> on 2023/09/06 13:31:16 UTC

[GitHub] [lucene] benwtrent commented on issue #12440: Make HNSW merges faster

benwtrent commented on issue #12440:
URL: https://github.com/apache/lucene/issues/12440#issuecomment-1708370161

   > For example, if we have segment 1,2,3,4 wants to merge and form a new segment, can we just leave the HNSW graphs as-is, or we only merge the smaller ones and leave the big ones as-is. And when we do searching, we can do a greedy search (always go to the closest node, whichever graph), or we can just let user choose to use multithreading to exchange for the latency?
   
   This would be tricky. This scales linearly per graph and technically recall is higher when searching multiple graphs than a single one. It seems that the `efSearch` (lucene's `k`) per nearest neighbor search should be adjusted internally if this is the approach taken. Thinking further, it seems like a nice experiment to make regardless to see how multi-segment search speed can be improved (potentially keeping the minimally matching score or candidates between graph searches...)
   
   @zhaih In short, I think this has merit, but we could look at how it can improve multi-segment search by keeping track of min_score or dynamically adjusting `efSearch` when multiple segments are being searched.
   
   > An idea, instead of trying to merge the subgraph, is to do a union of subgraphs:
   When we merge, we build a disconnected graph which is the union of all the segment graphs.
   
   @mbrette 
   
   🤔 I am not sure about this. It seems to me that allowing the graph to be disconnected like this (pure union) would be very difficult to keep track of within the Codec. Now we not only have each node's doc_id, but which graph they belong to, etc. This would cause significant overhead and seems like it wouldn't be worth the effort.
   
   > It's worth exploring some variation of this in my opinion.
   
   I agree @mbrette, which is why I linked the paper :). I think if we do something like the nnDescent but still adhering to the diversity of neighborhood, it could help. HNSW is fairly resilient and I wouldn't expect horrific recall changes.
   
   One tricky thing there would be when do we promote/demote a node to a higher/lower level in the graph? I am not sure nodes should simply stay at their levels as graph size increases.
   
   I haven't thought about what the logic could be for determining which NSW get put on which layer. It might be as simple as re-running the probabilistic level assignment for every node. Since nodes at higher levels are also on the lower levels, we could still get a significant performance improvement as we would only have to "start from scratch" on nodes that are promoted from a lower level to a higher level (which is very rare). Node demotions could still keep their previous connections (or keep what percentage of the connections we allow in our implementation).
   
   > What is the current way to measure Lucene knn recall/performance this days ?
   @mbrette 
   
   I tried to reproduced your test from https://github.com/apache/lucene/issues/11354#issuecomment-1239961308, but was not able to (recall = 0.574).
   
   [Lucene Util](https://github.com/mikemccand/luceneutil) is the way to do this. What issues are you having? It can be sort of frustrating to keep up and running, but once you got the flow down and it built, its useful. I will happily help with this or even run what benchmarking I can my self once you have some POC work.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org