You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/11/22 12:44:30 UTC

[GitHub] [lucene] jpountz opened a new issue, #11963: Improve vector quantization API

jpountz opened a new issue, #11963:
URL: https://github.com/apache/lucene/issues/11963

   ### Description
   
   Follow-up of https://github.com/apache/lucene/pull/11860#discussion_r1027106953: the API for quantization of vectors is a bit surprising at times in that you can index bytes but then still get float[] arrays whose values are bytes at search time. Can we make sure it's either bytes all the way, or floats all the way?
   
   I'm not entirely sure how best to move it forward. One option would be to have a disjoint set of APIs for the binary encoding, like for doc values: different `Field` class, different `VectorValues` class, different `searchNearestNeighbors` method. I wonder if another option would be to make it floats all the way all the time, including for the byte encoding, ie. the `Field` and `VectorValues` class would still take and expose `float`s but `VectorValues#binaryValue` would be removed and the `Field` constructor would barf if the float values do not represent exact bytes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] javanna commented on issue #11963: Improve vector quantization API

Posted by "javanna (via GitHub)" <gi...@apache.org>.
javanna commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1402141048

   #12105 and #12107 should complete the improvements that the description of this issue mentioned. I am leaning towards closing this if there are no objections.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] javanna commented on issue #11963: Improve vector quantization API

Posted by "javanna (via GitHub)" <gi...@apache.org>.
javanna commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1401129828

   It looks like we made all the suggested changes. One step that is missing is renaming the float classes and methods to follow exactly the naming convention suggested in the last comment. I opened #12105 for discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] benwtrent commented on issue #11963: Improve vector quantization API

Posted by GitBox <gi...@apache.org>.
benwtrent commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1351900658

   OK, step one of this is done with the `KnnByteVectorQuery`. I am next approaching a new `KnnByteVectorField` and this will cause some refactoring to `LeafReader` or `VectorValues`.
   
   Right now, `VectorValues` assumes all vector values are `float[]` and "expands" the bytes, needlessly.
   
   There are many ways to approach this and I am honestly not sure the way to go here (being unfamiliar with typical API patterns).
   
   It seems there are a handful of seemingly valid options:
   
    * 1. Add a method to `VectorValues` (which is returned via `LeafReader`) that explicitly returns `byteVectorValue()` and will throw if `vectorValue()` is called when the encoding is `BYTE` (and throw on `byteVectorValue()` if encoding is `FLOAT32`).
       * This doesn't seem like the best to me. Though we already require users to choose the correct methods based on the vector encoding.
    * 2. Add a new `AbstractVectorValues<T>` where `vectorValue()` returns `T`. `LeafReader` would be changed to return that instead of `VectorValues`. 
       * There don't seem to be precedent on using generics like this anywhere, and this breaks with how we handle DocValues. The main difference with vectors and doc values is that numeric doc values are all returned as Long, but really could be any number whose bytes FIT in a long...
    * 3. Add a new `LeafReader#getByteVectorValues()` that returns a `ByteVectorValues` object. The main downside here is that we are not making it flexible for adding new vector kinds in the future. We will want to add boolean vectors & hamming distance in the future. These are important for image search.
    * 4. Simply make `VectorValues` always return `BytesRef` for everything and provide similarity that knows the encoding and can decode the `float[]` if that is the encoding. This feels closer to what we do with doc values (taking `DoubleDocValuesField` & `DoubleValuesSource` as prior art with how that interaction is done with `NumericDocValues`).
   
   
   @jpountz @rmuir do y'all have opinions here? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on issue #11963: Improve vector quantization API

Posted by GitBox <gi...@apache.org>.
rmuir commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1351946133

   I would tend to expect some organization like this:
   
   indexing:
   * ByteVectorField
   * FloatVectorField
   
   searching:
   * ByteVectorField.newQuery
   * FloatVectorField.newQuery
   
   codec api:
   * KnnVectorsReader.getByteVectors(String field)
   * KnnVectorsReader.getFloatVectors(String field)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] benwtrent commented on issue #11963: Improve vector quantization API

Posted by GitBox <gi...@apache.org>.
benwtrent commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1323680557

   This is a naive question, but is there a design principle I am missing that makes `VectorValues<T>` out of line?
   
   I don't know how often folks will read the vector values directly instead of just searching them. But, transforming everything to float has a performance impact during ingest. Additionally, if the user is expecting bytes, they may have to transform their floats again when reading values, which is another performance overhead.
   
   So, I am against 'floats everywhere', but I realize that other things in Lucene have unified types. 
   
   I would prefer 'bytes everywhere' instead if we were pushing for a single array kind.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on issue #11963: Improve vector quantization API

Posted by GitBox <gi...@apache.org>.
jpountz commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1324845550

   > the types should be distinguished in fieldinfos today as well (not sure if that is currently the case)
   
   They are, `FieldInfo#vectorEncoding`.
   
   Trying to list what the change would look like:
    - Split `KnnVectorField` into `KnnFloatVectorField` and `KnnByteVectorField`.
    - Split `VectorValues` into `FloatVectorValues` and `ByteVectorValues`. `FloatVectorValues` would no longer expose its binary encoded value.
    - Split `VectorValues LeafReader#getVectorValues` into `ByteVectorValues LeafReader#getByteVectorValues` and `FloatVectorValues LeafReader#getFloatVectorValues`. These methods would return `null` if the field was indexed with the other type, like we're doing for doc values.
    - Split `LeafReader#searchNearestVectors` into one method that takes a `byte[]` and another one that takes a `float[]`. These methods would return empty hits if called with the wrong vector type, which would be consistent with exhaustive search by iterating over all vector values for the considered type.
    - Split `KnnVectorQuery` into `KnnFloatVectorQuery` and `KnnByteVectorQuery`
   
   Some questions:
    - Returning `null` instances for the wrong type in `LeadReader#getFloatVectorValues` would be consistent with doc values, but I wonder if we should rather fail in order to catch issues early. I can't think of a single case when it would be desirable to be lenient?
    - Generics would require passing array classes to a number of these methods, which I don't like much for only two types that need to be supported?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] benwtrent commented on issue #11963: Improve vector quantization API

Posted by GitBox <gi...@apache.org>.
benwtrent commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1335239951

   @rmuir thank you! I currently have a major part of the refactor already written locally. But, it will be big and will need its own rounds of review to make sure its the way we want it.
   
   I want to try and get the PR up (once the size improvements is merged) ASAP


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on issue #11963: Improve vector quantization API

Posted by GitBox <gi...@apache.org>.
rmuir commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1334640898

   @benwtrent go do your other issue first if you prefer. sorry for the trouble.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on issue #11963: Improve vector quantization API

Posted by GitBox <gi...@apache.org>.
rmuir commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1324355332

   as i mentioned on the original issue(s) (there were several PRs closed and opened and many comments were lost...), the problem is right in the title: quantization.
   
   there shouldnt be any quantization happening. otherwise it is broken.  8-bits in, 8-bits out, 32-bits in, 32-bits out. 
   
   Additionally as a minimum there needs to be two separate field classes (e.g. ByteVectorValues/FloatVectorValues) to give type safety. Really, the types should be distinguished in fieldinfos today as well (not sure if that is currently the case).
   
    Basically just follow the examples of every other part of the index. vectors isn't special.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on issue #11963: Improve vector quantization API

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1402147042

   +1 to close


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on issue #11963: Improve vector quantization API

Posted by GitBox <gi...@apache.org>.
jpountz commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1323732143

   I had not considered generics, I guess it would be an option indeed by passing the expected vector array class to `getVectorValues` and `searchNearestNeighbors`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on issue #11963: Improve vector quantization API

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1325838116

   For context, we also have this issue suggesting refactors to the 8-bit quantization change: https://github.com/apache/lucene/issues/11758. I'm +1 on revising the API and hope it helps clean up some internal complexity too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on issue #11963: Improve vector quantization API

Posted by GitBox <gi...@apache.org>.
rmuir commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1324355812

   we can break index backwards compatibility to fix this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] benwtrent commented on issue #11963: Improve vector quantization API

Posted by GitBox <gi...@apache.org>.
benwtrent commented on issue #11963:
URL: https://github.com/apache/lucene/issues/11963#issuecomment-1334127223

   @jpountz 
   
   > which I don't like much for only two types that need to be supported?
   
   I do think we will add more types in the future. Specifically `binary` which can use an optimized hamming distance. This is useful for image search.
   
   Splitting it all up like this is a big effort but I suppose its how it should have bene from the beginning.
   
   Frustrating to block https://github.com/apache/lucene/pull/11860 for this. 
   
   I will see about iterating on this refactor and see what the fallout is.
   
   @rmuir if you have any further opinions on how you think this should look, let me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] javanna closed issue #11963: Improve vector quantization API

Posted by "javanna (via GitHub)" <gi...@apache.org>.
javanna closed issue #11963: Improve vector quantization API
URL: https://github.com/apache/lucene/issues/11963


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org