You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/06/23 13:36:41 UTC
[GitHub] [lucene] msokolov commented on pull request #926: VectorSimilarityFunction reverse removal
msokolov commented on PR #926:
URL: https://github.com/apache/lucene/pull/926#issuecomment-1164418508
Hi Alessandro, thank you for running the tests. I'm suspicious of the results though -- they just look too good to be true! I know from profiling that we spend most of the time in similarity computations, yet this change doesn't impact how many of those we do nor how costly they are.
One thing I see is that you are using an `hdf5` file as input, but this tester was not designed to accept that format. This is a script I have used to extract raw floating-point data (what KnnGraphTester expects) from hdf5. This also takes care of normalizing to unit vectors, which you should do for angular data, but nor euclidean
```
import h5py
import numpy as np
import sys
with h5py.File(sys.argv[1], 'r') as f:
for key in f.keys():
print(f"{key}: {f[key].shape}")
ds = f[key]
print(f"copying {ds.shape} from {key}")
arr = np.zeros(ds.shape, dtype='float32')
ds.read_direct(arr)
# normalize all vectors (along dim 1) to unit length
norm = np.linalg.norm(arr, 2, 1)
norm[norm==0] = 1
arr = arr / np.expand_dims(norm, 1)
arr.tofile(sys.argv[1] + "-" + key)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org