You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/05/06 03:51:34 UTC

[GitHub] [lucene] LuXugang opened a new pull request, #870: LUCENE-10502: Refactor hnswVectors format

LuXugang opened a new pull request, #870:
URL: https://github.com/apache/lucene/pull/870

   follow-up of https://github.com/apache/lucene/pull/792


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [GitHub] [lucene] msokolov commented on pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by Michael Sokolov <ms...@gmail.com>.
Okay sorry I was confused about these override methods - they are
different because of the different access patterns in the sparse/dense
cases. Maybe the loss of history was unavoidable since we
moved/renamed the file, but I wish we could maintain it.

On Fri, May 13, 2022 at 1:45 PM GitBox <gi...@apache.org> wrote:
>
>
> msokolov commented on PR #870:
> URL: https://github.com/apache/lucene/pull/870#issuecomment-1126294216
>
>    Things have been moving kind of fast here! Which is great, but I am trying
>    to catch up and having trouble reconstructing the changes. Today on main
>    lucene92/OffHeapVectorValues.java has only one commit in its git history,
>    and I'm trying to find the place where we added the overrides of
>    vectorValue() and binaryValue() for the Sparse/Dense subclasses (since they
>    are copies, it seems weird). I think this has something to do with working
>    around JVM weirdness - I have a vague memory of a discussion about that,
>    but I can't find any record of it in git. I tried looking at the old (90 /
>    91) readers but I think these changes came after that. I wonder if we lost
>    the history while doing some git surgery on this feature branch?
>
>    On Tue, May 10, 2022 at 3:17 PM Lu Xugang ***@***.***> wrote:
>
>    > Thanks @mayya-sharipova <https://github.com/mayya-sharipova> , let's move
>    > to #877 <https://github.com/apache/lucene/pull/877> to continue this
>    > change.
>    >
>    > —
>    > Reply to this email directly, view it on GitHub
>    > <https://github.com/apache/lucene/pull/870#issuecomment-1122770417>, or
>    > unsubscribe
>    > <https://github.com/notifications/unsubscribe-auth/AAHHUQP7IQ2XWDGO4TVP773VJKY25ANCNFSM5VG5OC7A>
>    > .
>    > You are receiving this because you were mentioned.Message ID:
>    > ***@***.***>
>    >
>
>
>
> --
> This is an automated message from the Apache Git Service.
> To respond to the message, please log on to GitHub and use the
> URL above to go to the specific comment.
>
> To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
>
> For queries about this service, please contact Infrastructure at:
> users@infra.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
> For additional commands, e-mail: issues-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] [lucene] LuXugang commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
LuXugang commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r867034389


##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91HnswVectorsFormat.java:
##########
@@ -82,7 +82,7 @@
  *
  * @lucene.experimental
  */
-public final class Lucene91HnswVectorsFormat extends KnnVectorsFormat {
+public class Lucene91HnswVectorsFormat extends KnnVectorsFormat {

Review Comment:
   Same reason as above



##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91HnswVectorsFormat.java:
##########
@@ -101,25 +101,25 @@ public final class Lucene91HnswVectorsFormat extends KnnVectorsFormat {
    */
   public static final int DEFAULT_BEAM_WIDTH = 100;
 
-  static final int DIRECT_MONOTONIC_BLOCK_SHIFT = 16;
-
   /**
    * Controls how many of the nearest neighbor candidates are connected to the new node. Defaults to
    * {@link Lucene91HnswVectorsFormat#DEFAULT_MAX_CONN}. See {@link HnswGraph} for more details.
    */
-  private final int maxConn;
+  final int maxConn;

Review Comment:
   Same reason as above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova merged pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
mayya-sharipova merged PR #870:
URL: https://github.com/apache/lucene/pull/870


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r868790940


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene92;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.2 vector format, which encodes numeric vector values and an optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW search. The format
+ * consists of three files:
+ *
+ * <h2>.vec (vector data) file</h2>

Review Comment:
   Sorry to make so many small notes, but why is OrdToDoc in its own sublist instead of the top-level list? Also the note "only in sparse case" applies to both DocIds and OrdToDoc right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #870:
URL: https://github.com/apache/lucene/pull/870#issuecomment-1122736714

   @msokolov In the interest of time, I will be merging this PR to the feature branch `vectors-disi-direct`. There will an opportunity to provide more feedback once @LuXugang opens a new PR  to merge  `vectors-disi-direct` feature branch to `main`. I hope you don't mind. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
msokolov commented on PR #870:
URL: https://github.com/apache/lucene/pull/870#issuecomment-1127885286

   Thanks Mayya, I love the rapid iterations here. It's kind of funny to me
   that we have moved to use IndexedDISI to track docids here since it
   hearkens back to the first prototype I had posted, which used
   SortedNumericDocValues/BinaryDocValues to store the graph and vectors
   respectively.
   
   On Mon, May 16, 2022 at 10:14 AM Mayya Sharipova ***@***.***>
   wrote:
   
   > @msokolov <https://github.com/msokolov> Sorry that we moved fast without
   > letting you to study changes properly; we were trying to catch up the
   > upcoming 9.2 release.
   >
   > About standalone OffHeapVectorValues.java they were introduced in this PR
   > <https://github.com/apache/lucene/pull/792/files>. But the older version
   > of OffHeapVectorValues is still kept in Lucene91HnswVectorsReader
   > <https://github.com/apache/lucene/blob/main/lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91HnswVectorsReader.java#L398>,
   > and its history can be traced back.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/lucene/pull/870#issuecomment-1127730405>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAHHUQKSCSMC7AWWGZOUMODVKJJ45ANCNFSM5VG5OC7A>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] LuXugang commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
LuXugang commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r868760034


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene92;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.2 vector format, which encodes numeric vector values and an optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW search. The format
+ * consists of three files:
+ *
+ * <h2>.vec (vector data) file</h2>

Review Comment:
   @jtibshirani  addressed in https://github.com/apache/lucene/pull/870/commits/d1a26e55a6f65277e875e77fd096aa962988ef49



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] LuXugang commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
LuXugang commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r868794312


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene92;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.2 vector format, which encodes numeric vector values and an optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW search. The format
+ * consists of three files:
+ *
+ * <h2>.vec (vector data) file</h2>

Review Comment:
    Thanks @jtibshirani ,you are right, I like these spotless things, addressed in https://github.com/apache/lucene/pull/870/commits/e3ee29f24541d9b7e9b947ca183d665583200dfb 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] LuXugang commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
LuXugang commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r867032344


##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91Codec.java:
##########
@@ -164,7 +161,7 @@ public final PointsFormat pointsFormat() {
   }
 
   @Override
-  public final KnnVectorsFormat knnVectorsFormat() {
+  public KnnVectorsFormat knnVectorsFormat() {

Review Comment:
   Hi @msokolov , similar to any other old codecs in backward-codes which can't be used for writing and we still want to use`Lucene91Codec` in unit tests(see TestLucene91HnswVectorsFormat.java in RP). so we add a new codec named `Lucene91RWCodec` extended from `Lucene91Codec`  and create a new instance of `KnnVectorsFormat` in it which could do the write things, so we had to this method not final.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] LuXugang commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
LuXugang commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r867032344


##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91Codec.java:
##########
@@ -164,7 +161,7 @@ public final PointsFormat pointsFormat() {
   }
 
   @Override
-  public final KnnVectorsFormat knnVectorsFormat() {
+  public KnnVectorsFormat knnVectorsFormat() {

Review Comment:
   Hi @msokolov , similar to any other old codecs in backward-codes which can't be used for writing and we still want to use`Lucene91Codec` in unit tests(see TestLucene91HnswVectorsFormat.java in RP). so we add a new codec named `Lucene91RWCodec` extended from `Lucene91Codec`  and create a new instance of `KnnVectorsFormat` in it which could do the write things, so we had to make this method not final.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
msokolov commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r866972665


##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91Codec.java:
##########
@@ -164,7 +161,7 @@ public final PointsFormat pointsFormat() {
   }
 
   @Override
-  public final KnnVectorsFormat knnVectorsFormat() {
+  public KnnVectorsFormat knnVectorsFormat() {

Review Comment:
   is it necessary to make these no longer final? Are we extending? Why?



##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91HnswVectorsFormat.java:
##########
@@ -82,7 +82,7 @@
  *
  * @lucene.experimental
  */
-public final class Lucene91HnswVectorsFormat extends KnnVectorsFormat {
+public class Lucene91HnswVectorsFormat extends KnnVectorsFormat {

Review Comment:
   ditto - why not final?



##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91HnswVectorsFormat.java:
##########
@@ -101,25 +101,25 @@ public final class Lucene91HnswVectorsFormat extends KnnVectorsFormat {
    */
   public static final int DEFAULT_BEAM_WIDTH = 100;
 
-  static final int DIRECT_MONOTONIC_BLOCK_SHIFT = 16;
-
   /**
    * Controls how many of the nearest neighbor candidates are connected to the new node. Defaults to
    * {@link Lucene91HnswVectorsFormat#DEFAULT_MAX_CONN}. See {@link HnswGraph} for more details.
    */
-  private final int maxConn;
+  final int maxConn;

Review Comment:
   more similar issues -- this was private, now package private - do we need to expose this for some reason?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] LuXugang commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
LuXugang commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r867032344


##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91Codec.java:
##########
@@ -164,7 +161,7 @@ public final PointsFormat pointsFormat() {
   }
 
   @Override
-  public final KnnVectorsFormat knnVectorsFormat() {
+  public KnnVectorsFormat knnVectorsFormat() {

Review Comment:
   Hi @msokolov , similar to any other old codecs in backward-codes which can't be used for writing and we still want to use`Lucene91Codec` in unit tests(see TestLucene91HnswVectorsFormat.java in RP). so we create a new codec named `Lucene91RWCodec` extended from `Lucene91Codec`  and create a new instance of `KnnVectorsFormat` which could do the write things, so we had to this method not final.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #870:
URL: https://github.com/apache/lucene/pull/870#issuecomment-1121489041

   @msokolov Thanks for your feedback on this PR. I am wondering if you have any further feedback for this work. It would be nice to get it merged for 9.2 Lucene release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] LuXugang commented on pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
LuXugang commented on PR #870:
URL: https://github.com/apache/lucene/pull/870#issuecomment-1119767221

   oh~, it seems IDEA's refator did not work on this. addressed in https://github.com/apache/lucene/pull/870/commits/3a7a61a0bdeaa2ce3ed019f585f8333022fbab6a .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] LuXugang commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
LuXugang commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r867296165


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene92;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.2 vector format, which encodes numeric vector values and an optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW search. The format
+ * consists of three files:
+ *
+ * <h2>.vec (vector data) file</h2>

Review Comment:
   information added in https://github.com/apache/lucene/pull/870/commits/7a87469bf2af6ef4919ea762b1c728ff31bfb089.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #870:
URL: https://github.com/apache/lucene/pull/870#issuecomment-1122737290

   @LuXugang Please feel free to open a new PR to merge vectors-disi-direct feature branch to the main branch when you have time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r867171666


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene92;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.2 vector format, which encodes numeric vector values and an optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW search. The format
+ * consists of three files:
+ *
+ * <h2>.vec (vector data) file</h2>

Review Comment:
   Should we add information about the new data from `IndexedDISI#writeBitSet` and `DirectMonotonicWriter`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r868790940


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene92;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.2 vector format, which encodes numeric vector values and an optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW search. The format
+ * consists of three files:
+ *
+ * <h2>.vec (vector data) file</h2>

Review Comment:
   Sorry to make so many small notes, but why is OrdToDoc in its own sublist instead of the top-level list? Also the note "only in sparse case" applies to both DocIds and OrdToDoc right? The same comments apply to the javadoc about `.vem` below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #870:
URL: https://github.com/apache/lucene/pull/870#discussion_r868240961


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene92/Lucene92HnswVectorsFormat.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene92;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.2 vector format, which encodes numeric vector values and an optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW search. The format
+ * consists of three files:
+ *
+ * <h2>.vec (vector data) file</h2>

Review Comment:
   Thanks! Could you please list them in order like we do for the `.vex` file below? I think it makes it more precise/ easier to read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] LuXugang commented on pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
LuXugang commented on PR #870:
URL: https://github.com/apache/lucene/pull/870#issuecomment-1122770417

   Thanks @mayya-sharipova , let's move to https://github.com/apache/lucene/pull/877 to continue this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
msokolov commented on PR #870:
URL: https://github.com/apache/lucene/pull/870#issuecomment-1126294216

   Things have been moving kind of fast here! Which is great, but I am trying
   to catch up and having trouble reconstructing the changes. Today on main
   lucene92/OffHeapVectorValues.java has only one commit in its git history,
   and I'm trying to find the place where we added the overrides of
   vectorValue() and binaryValue() for the Sparse/Dense subclasses (since they
   are copies, it seems weird). I think this has something to do with working
   around JVM weirdness - I have a vague memory of a discussion about that,
   but I can't find any record of it in git. I tried looking at the old (90 /
   91) readers but I think these changes came after that. I wonder if we lost
   the history while doing some git surgery on this feature branch?
   
   On Tue, May 10, 2022 at 3:17 PM Lu Xugang ***@***.***> wrote:
   
   > Thanks @mayya-sharipova <https://github.com/mayya-sharipova> , let's move
   > to #877 <https://github.com/apache/lucene/pull/877> to continue this
   > change.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/lucene/pull/870#issuecomment-1122770417>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAHHUQP7IQ2XWDGO4TVP773VJKY25ANCNFSM5VG5OC7A>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #870: LUCENE-10502: Refactor hnswVectors format

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #870:
URL: https://github.com/apache/lucene/pull/870#issuecomment-1127730405

   @msokolov Sorry that we moved fast without letting you to study changes properly; we were trying to catch up the upcoming 9.2 release.
   
   About  standalone `OffHeapVectorValues.java` they were introduced in this [PR](https://github.com/apache/lucene/pull/792/files).  But the older version of `OffHeapVectorValues` is still kept in [Lucene91HnswVectorsReader](https://github.com/apache/lucene/blob/main/lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91HnswVectorsReader.java#L398), and its history can be traced back.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org