You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/05/25 04:15:19 UTC

[GitHub] [lucene] LuXugang commented on pull request #728: LUCENE-10194 Buffer KNN vectors on disk

LuXugang commented on PR #728:
URL: https://github.com/apache/lucene/pull/728#issuecomment-1136706160

   It seems like the core part is how to avoid that all vector values of all fields loaded into memory during Indexing. IIUC, as @rmuir said, we could stream vectors to the codec api directly.  a rough draft codec of `.vec` may seems like this:
   <img width="839" alt="image" src="https://user-images.githubusercontent.com/6985548/170176157-76bf2506-6c4b-480f-8191-919443077b15.png">
   
   
   Just similar to how `.fdx` wrote stored values on the fly. After `.vec` file closed, we then read this file and build a HNSW graph.
   
   We could locate one field's part vector values in a `chunk` by node and doc , but surely that it is bit slower compare that one field's all vector values stored in one continuous interval (vector value could be random access by ord(node) and dimension).
   
   > If a user had 100 vector fields, then now we might have 100+ files being written concurrently, multiplied by the number of segments we're writing at the same time. It seems like this could cause problems 
   
   @jtibshirani  , or we still try to write all field's all values to a single temp file like the picture above , when flush triggered, we read this temp file and create the Lucene92's codec ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org