You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/11/14 21:06:53 UTC

[GitHub] [lucene] benwtrent commented on pull request #11860: GITHUB-11830 Better optimize storage for vector connections

benwtrent commented on PR #11860:
URL: https://github.com/apache/lucene/pull/11860#issuecomment-1314379458

   I am digging back into this.
   
    - Vectors that are similar do not have similar doc_ids. To ensure this, some clustering would have to occur. This implies a different graph structure as vectors would be known ahead of time. 
       - In fact, having similar doc_ids for similar nodes might fix some issues we have in requiring ALL the `.vex` to be loaded in memory as we could jump to clusters of doc_ids to read the most relevant neighbors... While a fruitful endeavor, this is complicated and out of scope, for now.
    - We have seen graphs with as many as `Integer.MAX_VALUE` vectors per segment. However, these are exceptions.
   
   So:
    - The neighbors for a given vector have effectively random doc_ids from `[0, MAX_DOC)`
    - We read ALL neighbors at once when exploring
    - We must be able to skip quickly to a specific level and a specific node in that vector
   
   This leads me down the road to consider @jpountz `Option 1`. Unified bits-per-value(bpv) but with specialized bulk encoding/decoding, similar to ForUtil.
   
   ------------------
   @rmuir suggestion is that we use various bpv. To justify this, the set of neighbor distributions would have to be significantly different. I am not sure we can rely on this without some clustering mechanism guaranteeing doc_id distributions. Let me try to explain.
   
   Let's take two disconnected nodes, `a` and `b` and assume that each neighbor has a random doc ID from `[0, MAX_DOC)`. The number of neighbors is determined by `M`. Let's use the default value of `16`. Let us label the set of `M` neighbors for `a` and `b` to be `a_neighbors` and `b_neighbors`, respectively. For simplicity, let’s assume `a_neighbors` contains `MAX_DOC - 1` and thus has the worst case bpv requirement.
   
   This turns into a probabilistic problem. Basically, given the total population of documents `[0, MAX_DOC)`, what is the probability that every neighbor in `b_neighbor` requires at least 4 fewer bits than `MAX_DOC - 1`(I choose 4 bits as it fits nicely into our ForUtil optimizations,  we will snap to standard bpv in increments of 4)
   
   This can be expressed as independent random events. Consequently, there is a 1/8 chance to get a single document that requires 4 bits less information (the chance of getting a doc in `[0-(MAX_DOC - 1)/8]`). We have to do this independent 16 times (`M` times), which means we get `Math.pow(1.0/8, 16.0)` as our probability of success. Or `3.552713678800501E-15`, or expressed as a percentage 0.0000000000003% chance of it happening. TECHNICALLY, the probability is even worse as we are sampling without replacement. Meaning our probability gets worse than 1/8 as the desired set of docs to choose from gets smaller after every success. 
   
   I don’t think this justifies us storing various bpv. Until we can make sure that the collection of neighbors have some distribution across the doc_ids other than completely random.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org