You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "wangyanbn (via GitHub)" <gi...@apache.org> on 2023/05/04 04:20:38 UTC

[GitHub] [lucene] wangyanbn opened a new issue, #12263: Suggestion for VectorUtil.dotProductScore, to make the byte and float vector dotProductScore same

wangyanbn opened a new issue, #12263:
URL: https://github.com/apache/lucene/issues/12263

   ### Description
   
   When we choose the byte instead of float vector, we only wish to use less memory space, and we expect the bytes dot product scores are same as the float dot product scores. When we use byte, we usually multiply every item in normalized float vector by 128 ( and will ensure the max value <= 127 ). I think this is the common use case for choosing byte vector.
   
   In [VectorUtil.dotProductScore](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/VectorUtil.java#L268), the `denom` multiply the array length `a.length`, which makes the  byte vector dot product score very different from the float vector score. 
   ```
    // divide by 2 * 2^14 (maximum absolute value of product of 2 signed bytes) * len
    float denom = (float) (a.length * (1 << 15));
    return 0.5f + dotProduct(a, b) / denom;
   ```
   
   When the vector length is very large, such as 768 or 1024, the byte vector dot product scores are all near 0.50. If we show this score in the UI (such as in an image search app), it may confuse the user.
   
   And in hybrid retrieval use case, this byte/float score difference will affect the order of documents. When search with both normal query and knn vector search, the document score will be `knn_score*knn_boost + query_score*search_boost` (this is the case in ElasticSearch). Because the byte vector scores are very near for high dimension vector, they nearly have no effect on hybrid scores. The documents order may be different in byte vector and float vector., which is not what we expect.
   
   In [VectorUtil.dotProductScore](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/VectorUtil.java#L268) If `denom` does not multiply the array length, the byte dot product score will same as the float dot product score.
   Soļ¼Œhall we change byte dot product score logic? 
   Thanks a lot!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org