You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/03/04 00:23:37 UTC
[GitHub] [lucene-solr] jtibshirani opened a new pull request #1314: Coarse
quantization
jtibshirani opened a new pull request #1314: Coarse quantization
URL: https://github.com/apache/lucene-solr/pull/1314
**Note:** this PR is just meant to sketch out an idea and is not meant for detailed review.
This PR shows a kNN approach based on coarse quantization (IVFFlat). It adds a new format `VectorsFormat`, which simply delegates to `DocValuesFormat` and `PostingsFormat` under the hood:
* The original vectors are stored as `BinaryDocValues`.
* The vectors are also clustered, and the cluster information is stored in postings format. In particular, each cluster centroid is encoded to a `BytesRef` to represent a term. Each document belonging to the centroid is added to the postings list for that term.
There are currently some pretty big hacks:
* We re-use the existing doc values and postings formats for simplicity. This is fairly fragile since we write to the same files as normal doc values and postings -- I think there would be a conflict if there were both a vector field and a doc values field with the same name.
* To write the postings list, we compute the map from centroid to documents in memory. We then expose it through a hacky `Fields` implementation called `ClusterBackedFields` and pass it to the postings writer. It would be better to avoid this hack and not to compute cluster information using a map.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani commented on issue #1314: LUCENE-9136:
Coarse quantization that reuses existing formats.
Posted by GitBox <gi...@apache.org>.
jtibshirani commented on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326
**Benchmarks**
In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second. For the number of centroids, we use the heuristic num centroids = sqrt(dataset size).
sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.425
LuceneCluster(n_probes=5) 0.756 604.133
LuceneCluster(n_probes=10) 0.874 323.791
LuceneCluster(n_probes=20) 0.951 166.580
LuceneCluster(n_probes=50) 0.993 68.465
LuceneCluster(n_probes=100) 0.999 35.139
```
glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.764
LuceneCluster(n_probes=5) 0.681 642.247
LuceneCluster(n_probes=10) 0.768 343.067
LuceneCluster(n_probes=20) 0.836 177.037
LuceneCluster(n_probes=50) 0.908 73.256
LuceneCluster(n_probes=100) 0.951 37.302
```
These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). The branch that I use to perform benchmarking
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314:
LUCENE-9136: Coarse quantization that reuses existing formats.
Posted by GitBox <gi...@apache.org>.
jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
**Benchmarks**
sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.425
LuceneCluster(n_probes=5) 0.749 574.186
LuceneCluster(n_probes=10) 0.874 308.455
LuceneCluster(n_probes=20) 0.951 116.871
LuceneCluster(n_probes=50) 0.993 67.354
LuceneCluster(n_probes=100) 0.999 34.651
```
glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.722
LuceneCluster(n_probes=5) 0.680 618.438
LuceneCluster(n_probes=10) 0.766 335.956
LuceneCluster(n_probes=20) 0.835 173.782
LuceneCluster(n_probes=50) 0.905 72.747
LuceneCluster(n_probes=100) 0.948 37.339
```
These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on more robust benchmarks.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314:
LUCENE-9136: Coarse quantization that reuses existing formats.
Posted by GitBox <gi...@apache.org>.
jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
**Benchmarks**
In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second.
sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.425
LuceneCluster(n_probes=5) 0.749 574.186
LuceneCluster(n_probes=10) 0.874 308.455
LuceneCluster(n_probes=20) 0.951 116.871
LuceneCluster(n_probes=50) 0.993 67.354
LuceneCluster(n_probes=100) 0.999 34.651
```
glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.722
LuceneCluster(n_probes=5) 0.680 618.438
LuceneCluster(n_probes=10) 0.766 335.956
LuceneCluster(n_probes=20) 0.835 173.782
LuceneCluster(n_probes=50) 0.905 72.747
LuceneCluster(n_probes=100) 0.948 37.339
```
These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on more robust benchmarks.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314:
LUCENE-9136: Coarse quantization that reuses existing formats.
Posted by GitBox <gi...@apache.org>.
jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
**Benchmarks**
sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.425
LuceneCluster(n_probes=2) 0.536 1138.926
LuceneCluster(n_probes=5) 0.749 574.186
LuceneCluster(n_probes=10) 0.874 308.455
LuceneCluster(n_probes=20) 0.951 116.871
LuceneCluster(n_probes=50) 0.993 67.354
LuceneCluster(n_probes=100) 0.999 34.651
```
glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.722
LuceneCluster(n_probes=5) 0.680 618.438
LuceneCluster(n_probes=10) 0.766 335.956
LuceneCluster(n_probes=20) 0.835 173.782
LuceneCluster(n_probes=50) 0.905 72.747
LuceneCluster(n_probes=100) 0.948 37.339
```
These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on more robust benchmarks.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani removed a comment on issue #1314:
LUCENE-9136: Coarse quantization that reuses existing formats.
Posted by GitBox <gi...@apache.org>.
jtibshirani removed a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
**Benchmarks**
In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second. For the number of centroids, we use the heuristic num centroids = sqrt(dataset size).
sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.425
LuceneCluster(n_probes=5) 0.749 574.186
LuceneCluster(n_probes=10) 0.874 308.455
LuceneCluster(n_probes=20) 0.951 116.871
LuceneCluster(n_probes=50) 0.993 67.354
LuceneCluster(n_probes=100) 0.999 34.651
```
glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.722
LuceneCluster(n_probes=5) 0.680 618.438
LuceneCluster(n_probes=10) 0.766 335.956
LuceneCluster(n_probes=20) 0.835 173.782
LuceneCluster(n_probes=50) 0.905 72.747
LuceneCluster(n_probes=100) 0.948 37.339
```
These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on a more robust benchmarking set-up.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314:
LUCENE-9136: Coarse quantization that reuses existing formats.
Posted by GitBox <gi...@apache.org>.
jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
**Benchmarks**
In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second.
sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.425
LuceneCluster(n_probes=5) 0.749 574.186
LuceneCluster(n_probes=10) 0.874 308.455
LuceneCluster(n_probes=20) 0.951 116.871
LuceneCluster(n_probes=50) 0.993 67.354
LuceneCluster(n_probes=100) 0.999 34.651
```
glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.722
LuceneCluster(n_probes=5) 0.680 618.438
LuceneCluster(n_probes=10) 0.766 335.956
LuceneCluster(n_probes=20) 0.835 173.782
LuceneCluster(n_probes=50) 0.905 72.747
LuceneCluster(n_probes=100) 0.948 37.339
```
These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on a more robust benchmarking set-up.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314:
LUCENE-9136: Coarse quantization that reuses existing formats.
Posted by GitBox <gi...@apache.org>.
jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
**Benchmarks**
In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second. For the number of centroids, we use the heuristic num centroids = sqrt(dataset size).
sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.425
LuceneCluster(n_probes=5) 0.749 574.186
LuceneCluster(n_probes=10) 0.874 308.455
LuceneCluster(n_probes=20) 0.951 116.871
LuceneCluster(n_probes=50) 0.993 67.354
LuceneCluster(n_probes=100) 0.999 34.651
```
glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.722
LuceneCluster(n_probes=5) 0.680 618.438
LuceneCluster(n_probes=10) 0.766 335.956
LuceneCluster(n_probes=20) 0.835 173.782
LuceneCluster(n_probes=50) 0.905 72.747
LuceneCluster(n_probes=100) 0.948 37.339
```
These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on a more robust benchmarking set-up.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314:
LUCENE-9136: Coarse quantization that reuses existing formats.
Posted by GitBox <gi...@apache.org>.
jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326
**Benchmarks**
In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second. For the number of centroids, we use the heuristic num centroids = sqrt(dataset size).
sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.425
LuceneCluster(n_probes=5) 0.756 604.133
LuceneCluster(n_probes=10) 0.874 323.791
LuceneCluster(n_probes=20) 0.951 166.580
LuceneCluster(n_probes=50) 0.993 68.465
LuceneCluster(n_probes=100) 0.999 35.139
```
glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.764
LuceneCluster(n_probes=5) 0.681 642.247
LuceneCluster(n_probes=10) 0.768 343.067
LuceneCluster(n_probes=20) 0.836 177.037
LuceneCluster(n_probes=50) 0.908 73.256
LuceneCluster(n_probes=100) 0.951 37.302
```
These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). The branch and instructions for benchmarking can be found here: https://github.com/jtibshirani/ann-benchmarks/pull/2.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani commented on issue #1314: Coarse
quantization
Posted by GitBox <gi...@apache.org>.
jtibshirani commented on issue #1314: Coarse quantization
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
**Benchmarks**
sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.425
LuceneCluster(n_probes=2) 0.536 1138.926
LuceneCluster(n_probes=5) 0.749 574.186
LuceneCluster(n_probes=10) 0.874 308.455
LuceneCluster(n_probes=20) 0.951 1161.871
LuceneCluster(n_probes=50) 0.993 67.354
LuceneCluster(n_probes=100) 0.999 34.651
```
glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
```
APPROACH RECALL QPS
LuceneExact() 1.000 6.722
LuceneCluster(n_probes=5) 0.680 618.438
LuceneCluster(n_probes=10) 0.766 335.956
LuceneCluster(n_probes=20) 0.835 173.782
LuceneCluster(n_probes=50) 0.905 72.747
LuceneCluster(n_probes=100) 0.948 37.339
```
These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on more robust benchmarks.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org