You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/06/23 13:36:41 UTC

[GitHub] [lucene] msokolov commented on pull request #926: VectorSimilarityFunction reverse removal

msokolov commented on PR #926:
URL: https://github.com/apache/lucene/pull/926#issuecomment-1164418508

   Hi Alessandro, thank you for running the tests. I'm suspicious of the results though -- they just look too good to be true! I know from profiling that we spend most of the time in similarity computations, yet this change doesn't impact how many of those we do nor how costly they are.
   
   One thing I see is that you are using an `hdf5` file as input, but this tester was not designed to accept that format. This is a script I have used to extract raw floating-point data (what KnnGraphTester expects) from hdf5. This also takes care of normalizing to unit vectors, which you should do for angular data, but nor euclidean
   
   ```
   import h5py
   import numpy as np
   import sys
   
   with h5py.File(sys.argv[1], 'r') as f:
       for key in f.keys():
           print(f"{key}: {f[key].shape}")
           ds = f[key]
           print(f"copying {ds.shape} from {key}")
           arr = np.zeros(ds.shape, dtype='float32')
           ds.read_direct(arr)
   
           # normalize all vectors (along dim 1) to unit length
           norm = np.linalg.norm(arr, 2, 1)
           norm[norm==0] = 1
           arr = arr / np.expand_dims(norm, 1)
   
           arr.tofile(sys.argv[1] + "-" + key)
   ```
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org