You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/03/04 00:23:37 UTC

[GitHub] [lucene-solr] jtibshirani opened a new pull request #1314: Coarse quantization

jtibshirani opened a new pull request #1314: Coarse quantization
URL: https://github.com/apache/lucene-solr/pull/1314

**Note:** this PR is just meant to sketch out an idea and is not meant for detailed review.

This PR shows a kNN approach based on coarse quantization (IVFFlat). It adds a new format `VectorsFormat`, which simply delegates to `DocValuesFormat` and `PostingsFormat` under the hood:
* The original vectors are stored as `BinaryDocValues`.
* The vectors are also clustered, and the cluster information is stored in postings format. In particular, each cluster centroid is encoded to a `BytesRef` to represent a term. Each document belonging to the centroid is added to the postings list for that term.

There are currently some pretty big hacks:
* We re-use the existing doc values and postings formats for simplicity. This is fairly fragile since we write to the same files as normal doc values and postings -- I think there would be a conflict if there were both a vector field and a doc values field with the same name.
* To write the postings list, we compute the map from centroid to documents in memory. We then expose it through a hacky `Fields` implementation called `ClusterBackedFields` and pass it to the postings writer. It would be better to avoid this hack and not to compute cluster information using a map.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani commented on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326
 
 
   **Benchmarks**
   In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second. For the number of centroids, we use the heuristic num centroids = sqrt(dataset size).
   
   sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.425
   LuceneCluster(n_probes=5)         0.756      604.133
   LuceneCluster(n_probes=10)        0.874      323.791
   LuceneCluster(n_probes=20)        0.951      166.580
   LuceneCluster(n_probes=50)        0.993       68.465
   LuceneCluster(n_probes=100)       0.999       35.139
   ```
   
   glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.764
   LuceneCluster(n_probes=5)         0.681      642.247
   LuceneCluster(n_probes=10)        0.768      343.067
   LuceneCluster(n_probes=20)        0.836      177.037
   LuceneCluster(n_probes=50)        0.908       73.256
   LuceneCluster(n_probes=100)       0.951       37.302
   ```
   
   These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). The branch that I use to perform benchmarking 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.

Posted by GitBox <gi...@apache.org>.

jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
 
 
   **Benchmarks**
   
   sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.425
   LuceneCluster(n_probes=5)         0.749      574.186
   LuceneCluster(n_probes=10)        0.874      308.455
   LuceneCluster(n_probes=20)        0.951      116.871
   LuceneCluster(n_probes=50)        0.993       67.354
   LuceneCluster(n_probes=100)       0.999       34.651
   ```
   
   glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.722
   LuceneCluster(n_probes=5)         0.680      618.438
   LuceneCluster(n_probes=10)        0.766      335.956
   LuceneCluster(n_probes=20)        0.835      173.782
   LuceneCluster(n_probes=50)        0.905       72.747
   LuceneCluster(n_probes=100)       0.948       37.339
   ```
   
   These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on more robust benchmarks.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.

Posted by GitBox <gi...@apache.org>.

jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
 
 
   **Benchmarks**
   In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second.
   
   sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.425
   LuceneCluster(n_probes=5)         0.749      574.186
   LuceneCluster(n_probes=10)        0.874      308.455
   LuceneCluster(n_probes=20)        0.951      116.871
   LuceneCluster(n_probes=50)        0.993       67.354
   LuceneCluster(n_probes=100)       0.999       34.651
   ```
   
   glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.722
   LuceneCluster(n_probes=5)         0.680      618.438
   LuceneCluster(n_probes=10)        0.766      335.956
   LuceneCluster(n_probes=20)        0.835      173.782
   LuceneCluster(n_probes=50)        0.905       72.747
   LuceneCluster(n_probes=100)       0.948       37.339
   ```
   
   These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on more robust benchmarks.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.

Posted by GitBox <gi...@apache.org>.

jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
 
 
   **Benchmarks**
   
   sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.425
   LuceneCluster(n_probes=2)         0.536     1138.926
   LuceneCluster(n_probes=5)         0.749      574.186
   LuceneCluster(n_probes=10)        0.874      308.455
   LuceneCluster(n_probes=20)        0.951      116.871
   LuceneCluster(n_probes=50)        0.993       67.354
   LuceneCluster(n_probes=100)       0.999       34.651
   ```
   
   glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.722
   LuceneCluster(n_probes=5)         0.680      618.438
   LuceneCluster(n_probes=10)        0.766      335.956
   LuceneCluster(n_probes=20)        0.835      173.782
   LuceneCluster(n_probes=50)        0.905       72.747
   LuceneCluster(n_probes=100)       0.948       37.339
   ```
   
   These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on more robust benchmarks.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani removed a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.

Posted by GitBox <gi...@apache.org>.

jtibshirani removed a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
 
 
   **Benchmarks**
   In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second. For the number of centroids, we use the heuristic num centroids = sqrt(dataset size).
   
   sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.425
   LuceneCluster(n_probes=5)         0.749      574.186
   LuceneCluster(n_probes=10)        0.874      308.455
   LuceneCluster(n_probes=20)        0.951      116.871
   LuceneCluster(n_probes=50)        0.993       67.354
   LuceneCluster(n_probes=100)       0.999       34.651
   ```
   
   glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.722
   LuceneCluster(n_probes=5)         0.680      618.438
   LuceneCluster(n_probes=10)        0.766      335.956
   LuceneCluster(n_probes=20)        0.835      173.782
   LuceneCluster(n_probes=50)        0.905       72.747
   LuceneCluster(n_probes=100)       0.948       37.339
   ```
   
   These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on a more robust benchmarking set-up.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.

Posted by GitBox <gi...@apache.org>.

jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
 
 
   **Benchmarks**
   In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second.
   
   sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.425
   LuceneCluster(n_probes=5)         0.749      574.186
   LuceneCluster(n_probes=10)        0.874      308.455
   LuceneCluster(n_probes=20)        0.951      116.871
   LuceneCluster(n_probes=50)        0.993       67.354
   LuceneCluster(n_probes=100)       0.999       34.651
   ```
   
   glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.722
   LuceneCluster(n_probes=5)         0.680      618.438
   LuceneCluster(n_probes=10)        0.766      335.956
   LuceneCluster(n_probes=20)        0.835      173.782
   LuceneCluster(n_probes=50)        0.905       72.747
   LuceneCluster(n_probes=100)       0.948       37.339
   ```
   
   These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on a more robust benchmarking set-up.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.

Posted by GitBox <gi...@apache.org>.

jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
 
 
   **Benchmarks**
   In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second. For the number of centroids, we use the heuristic num centroids = sqrt(dataset size).
   
   sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.425
   LuceneCluster(n_probes=5)         0.749      574.186
   LuceneCluster(n_probes=10)        0.874      308.455
   LuceneCluster(n_probes=20)        0.951      116.871
   LuceneCluster(n_probes=50)        0.993       67.354
   LuceneCluster(n_probes=100)       0.999       34.651
   ```
   
   glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.722
   LuceneCluster(n_probes=5)         0.680      618.438
   LuceneCluster(n_probes=10)        0.766      335.956
   LuceneCluster(n_probes=20)        0.835      173.782
   LuceneCluster(n_probes=50)        0.905       72.747
   LuceneCluster(n_probes=100)       0.948       37.339
   ```
   
   These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on a more robust benchmarking set-up.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.

Posted by GitBox <gi...@apache.org>.

jtibshirani edited a comment on issue #1314: LUCENE-9136: Coarse quantization that reuses existing formats.
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326
 
 
   **Benchmarks**
   In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second. For the number of centroids, we use the heuristic num centroids = sqrt(dataset size).
   
   sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.425
   LuceneCluster(n_probes=5)         0.756      604.133
   LuceneCluster(n_probes=10)        0.874      323.791
   LuceneCluster(n_probes=20)        0.951      166.580
   LuceneCluster(n_probes=50)        0.993       68.465
   LuceneCluster(n_probes=100)       0.999       35.139
   ```
   
   glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.764
   LuceneCluster(n_probes=5)         0.681      642.247
   LuceneCluster(n_probes=10)        0.768      343.067
   LuceneCluster(n_probes=20)        0.836      177.037
   LuceneCluster(n_probes=50)        0.908       73.256
   LuceneCluster(n_probes=100)       0.951       37.302
   ```
   
   These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). The branch and instructions for benchmarking can be found here: https://github.com/jtibshirani/ann-benchmarks/pull/2.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani commented on issue #1314: Coarse quantization

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on issue #1314: Coarse quantization
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
 
 
   **Benchmarks**
   
   sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.425
   LuceneCluster(n_probes=2)         0.536     1138.926
   LuceneCluster(n_probes=5)         0.749      574.186
   LuceneCluster(n_probes=10)        0.874      308.455
   LuceneCluster(n_probes=20)        0.951     1161.871
   LuceneCluster(n_probes=50)        0.993       67.354
   LuceneCluster(n_probes=100)       0.999       34.651
   ```
   
   glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
   ```
   APPROACH                          RECALL     QPS
   LuceneExact()                     1.000        6.722
   LuceneCluster(n_probes=5)         0.680      618.438
   LuceneCluster(n_probes=10)        0.766      335.956
   LuceneCluster(n_probes=20)        0.835      173.782
   LuceneCluster(n_probes=50)        0.905       72.747
   LuceneCluster(n_probes=100)       0.948       37.339
   ```
   
   These benchmarks were performed using the [ann-benchmarks repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to the benchmarking framework using py4j (e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit of overhead (~3ms per search), so I had to measure that overhead and subtract it from the results. This is really not ideal, I will work on more robust benchmarks.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org