You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by so...@apache.org on 2021/03/18 13:39:36 UTC

[lucene] branch main updated: LUCENE-9844: document disk layout of Lucene90VectorFormat

This is an automated email from the ASF dual-hosted git repository.

sokolov pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/lucene.git


The following commit(s) were added to refs/heads/main by this push:
     new 5b36af3  LUCENE-9844: document disk layout of Lucene90VectorFormat
5b36af3 is described below

commit 5b36af3cd7978fed5fbbfb0bab5405848acc2d7b
Author: Michael Sokolov <so...@falutin.net>
AuthorDate: Thu Mar 18 09:39:23 2021 -0400

    LUCENE-9844: document disk layout of Lucene90VectorFormat
---
 .../codecs/lucene90/Lucene90VectorFormat.java      | 37 +++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorFormat.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorFormat.java
index c792b4b..53b86bf 100644
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorFormat.java
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorFormat.java
@@ -25,7 +25,42 @@ import org.apache.lucene.index.SegmentReadState;
 import org.apache.lucene.index.SegmentWriteState;
 
 /**
- * Lucene 9.0 vector format, which encodes dense numeric vector values.
+ * Lucene 9.0 vector format, which encodes numeric vector values and an optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW search. The format
+ * consists of three files:
+ *
+ * <h1>.vec (vector data) file</h1>
+ *
+ * <p>This file stores all the floating-point vector data ordered by field, document ordinal, and
+ * vector dimension. The floats are stored in little-endian byte order.
+ *
+ * <h1>.vex (vector index) file</h1>
+ *
+ * <p>Stores graphs connecting the documents for each field. For each document having a vector for a
+ * given field, this is stored as:
+ *
+ * <ul>
+ *   <li><b>[int32]</b> the number of neighbor nodes
+ *   <li><b>array[vint]</b> the neighbor ordinals, delta-encoded (initially subtracting -1)
+ * </ul>
+ *
+ * <h1>.vem (vector metadata) file</h1>
+ *
+ * <p>For each field:
+ *
+ * <ul>
+ *   <li><b>[int32]</b> field number
+ *   <li><b>[int32]</b> vector search strategy ordinal
+ *   <li><b>[vlong]</b> offset to this field's vectors in the .vec file
+ *   <li><b>[vlong]</b> length of this field's vectors, in bytes
+ *   <li><b>[vlong]</b> offset to this field's index in the .vex file
+ *   <li><b>[vlong]</b> length of this field's index data, in bytes
+ *   <li><b>[int]</b> dimension of this field's vectors
+ *   <li><b>[int]</b> the number of documents having values for this field
+ *   <li><b>array[vint]</b> the docids of documents having vectors, in order
+ *   <li><b>array[vlong]</b> for each document having a vector, the offset (delta-encoded relative
+ *       to the previous document) of its entry in the .vex file
+ * </ul>
  *
  * @lucene.experimental
  */