You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Jack Mazanec (Jira)" <ji...@apache.org> on 2022/08/17 18:32:00 UTC

[jira] [Commented] (LUCENE-10318) Reuse HNSW graphs when merging segments?

    [ https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580935#comment-17580935 ] 

Jack Mazanec commented on LUCENE-10318:
---------------------------------------

Hi [~julietibs]  I was thinking about something similar and would be interested in working on this. I can run some experiments to see if this would improve performance, if you haven’t already started to do so.

Additionally, I am wondering if it would make sense to extend this to support graphs that contain deleted nodes. I can think of an approach, but it is a little messy. It would follow the same idea for merging — add vectors from smaller graph into larger graph. However, before adding vectors from smaller graph, all of the deleted nodes would need to be removed from the larger graph.

In order to remove a node from the graph, I think we would need to remove it from list of neighbor arrays for each level it is in. In addition to this, because removal would break the ordinals, we would have to update all of the ordinals in the graph, which for OnHeapHNSW graph would mean updating all nodes by levels and also potentially each neighbor in each NeighborArray in the graph. 

Because removing a node could cause a number of nodes in the graph to lose a neighbor, we would need to repair the graph. To do this, I think we could create a _repair_list_ that tracks the nodes that lost a connection due to the deleted node{_}.{_} To fill the list, we would need to iterate over all of the nodes in the graph and then check if any of their _m_ connections are to the deleted node (I think this could be done when the ordinals are being updated). If so, remove the connection and then add the node to the {_}repair_list{_}.

Once the _repair_list_ is complete, for each node in the list, search the graph to get new neighbors to fill up the node’s connections to the desired amount. At this point, I would expect the time it takes to finish merging to be equal to the time it takes to insert the number of live vectors in the smaller graph plus the size of the repair list into the large graph.

All that being said, I am not sure if removing deleted nodes in the graph would be faster than just building the graph from scratch. From the logic above, we would need to at least iterate over each connection in the graph and potentially perform several list deletions. My guess is that when the repair list is small, I think it would be faster, but when it is large, probably not. I am going to try to start playing around with this idea, but please let me know what you think!

> Reuse HNSW graphs when merging segments?
> ----------------------------------------
>
>                 Key: LUCENE-10318
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10318
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Julie Tibshirani
>            Priority: Major
>
> Currently when merging segments, the HNSW vectors format rebuilds the entire graph from scratch. In general, building these graphs is very expensive, and it'd be nice to optimize it in any way we can. I was wondering if during merge, we could choose the largest segment with no deletes, and load its HNSW graph into heap. Then we'd add vectors from the other segments to this graph, through the normal build process. This could cut down on the number of operations we need to perform when building the graph.
> This is just an early idea, I haven't run experiments to see if it would help. I'd guess that whether it helps would also depend on details of the MergePolicy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org