You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Pedram Rezaei <pe...@microsoft.com.INVALID> on 2019/02/27 00:44:15 UTC

Vector based store and ANN

Hi there,

Is there a way to store numerical vectors (vector based index) and perform search based on Approximate Nearest Neighbor<https://github.com/erikbern/ann-benchmarks> class of algorithms in Lucene?

If not, has there been any interests in the topic so far?

Thanks,

Pedram

Re: Vector based store and ANN

Posted by Doug Turnbull <dt...@opensourceconnections.com>.
Hi Pedram (and community)

The invite link for Relevance slack is http://o19s.com/slack ...

Best!
-Doug

On Fri, Mar 1, 2019 at 7:23 PM Pedram Rezaei <pe...@microsoft.com.invalid>
wrote:

> Hi there,
>
>
>
> Thank you for sharing your thoughts. I am finding them extremely useful
> and to be honest exciting!
>
>
>
> Regarding the vector-based scoring, you are 100% correct. There are many
> ways of having an efficient vector-based similarity scorer implemented on
> top of an encoded vector stored at the document level in Lucene.
>
>
>
> As you have rightly pointed out, this in itself might not be sufficient
> for large indexes. After all, the engine would need to read the vector per
> document and then calculate similarity.
>
>
>
> LSH or similar n-pass (n>1) techniques are pretty interesting and
> certainly can get us closer to using the existing index for lookup. As you
> rightly point out below, it can come at a cost either to the performance or
> the precision.
>
>
>
> I am personally very intrigued by the new generation of vector-based
> indexes such as Facebook’s FAISS
> <https://github.com/facebookresearch/faiss> library for similarity search
> and clustering of dense vectors used as part of larger search offerings. Do
> you think there might be a world in which Lucene would want to have
> first-class support for vector-based searches? I think with such a
> capability, we might open the door for new and innovative ways of
> information retrieval.
>
>
>
> I am grateful to you all for your insights and this fascinating discussion!
>
>
>
> Pedram
>
>
>
> P.S. How do I join https://relevancy.slack.com
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Frelevancy.slack.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=p41mJtw39mq5qG5oy3MZOEWHT%2BrfFqeANLhFcLOtIIo%3D&reserved=0>
> ?
>
>
>
> *From:* René Kriegler <rk...@rene-kriegler.de>
> *Sent:* Friday, March 1, 2019 3:24 PM
> *To:* Pedram Rezaei <pe...@microsoft.com>
> *Cc:* dev@lucene.apache.org; Radhakrishnan Srikanth (SRIKANTH) <
> rsrikan@microsoft.com>; Arun Sacheti <ar...@bing.com>; Kun Wu <
> Wu.Kun@microsoft.com>; Junhua Wang <Ju...@microsoft.com>; Jason Li <
> jasol@microsoft.com>
> *Subject:* Re: Vector based store and ANN
>
>
>
> Hi there,
>
>
>
> Thank you for looping me in. Just a few random thoughts on this topic:
>
>
>
> - I’ve heard ;-) that this ES plugin is fast for vector-based scoring:
> https://github.com/StaySense/fast-cosine-similarity
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FStaySense%2Ffast-cosine-similarity&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499747411&sdata=kHEdnvi3o9ZfSAiE%2FJQIhbI54Zf%2BLEwr%2F%2B40tpFDnv8%3D&reserved=0>.
> The links in the ‘General’ section provide some history. As far as I can
> see, there is nothing which couldn’t be implemented at Lucene level.
>
>
>
> - For retrieval, I think a two-pass approach looks like something worth
> trying out. First pass: look up documents in a low dimensional space (maybe
> produced via LSH) and then, in the second pass, calculate vector distances
> in the high-dimensional space just for the documents that resulted from the
> first pass. This solution will come with some compromises to make. For
> example, a higher dimensionality of LSH would increase precision but also
> produce more hash tokens and make the lookup slower, especially for large
> indexes.
>
>
>
> - Day 2 of Haystack 2019 (https://haystackconf.com/agenda/
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2Fagenda%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=6cD7%2BHoxVsuozhLN27m7Jmowv3D4CUYtVHCipGRO8Ss%3D&reserved=0>)
> will have a talk by Simon Hughes about ’Search with Vectors’. There is a
> channel on this topic at OpenSource Connections’ search relevance Slack (
> https://relevancy.slack.com
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Frelevancy.slack.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=p41mJtw39mq5qG5oy3MZOEWHT%2BrfFqeANLhFcLOtIIo%3D&reserved=0>)
> and Simon has been one of the drivers of the discussion.
>
>
>
> Best,
>
> René
>
>
>
>
>
> On 1 Mar 2019, at 20:51, Pedram Rezaei <pe...@microsoft.com> wrote:
>
>
>
> Thank you for sharing, and it is exciting to see how advanced your
> thinking is.
>
>
>
> Yes, the idea is the same idea with an extra step that Rene also seems to
> elude to here
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRenKriegler%2Fa-picture-is-worth-a-thousand-words-93680178&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499767411&sdata=BRJS4wkx7vRY8CX%2FiPvvltx41uy%2BwBAwtMEEoE1Gcag%3D&reserved=0>
>  in his comment. Instead of using these types of techniques only at the
> scoring time, we can use them for information retrieval from the index.
> This will allow us to, for example, index millions of images and quickly
> and efficiently lookup the most relevant images.
>
>
>
> I would love to hear yours and others thoughts on this. I think there is a
> great opportunity here, but it would need a lot of input and guidance from
> the experts here.
>
>
>
> Thank you,
>
>
>
> Pedram
>
>
>
> *From:* David Smiley <da...@gmail.com>
> *Sent:* Friday, March 1, 2019 12:11 PM
> *To:* dev@lucene.apache.org
> *Cc:* Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>; Arun
> Sacheti <ar...@bing.com>; Kun Wu <Wu...@microsoft.com>; Junhua Wang <
> Junhua.Wang@microsoft.com>; Jason Li <ja...@microsoft.com>; René Kriegler
> <po...@rene-kriegler.com>
> *Subject:* Re: Vector based store and ANN
>
>
>
> This presentation by Rene Kriegler at Haystack 2018 was a real eye-opener
> to me on this subject: https://haystackconf.com/2018/relevance-scoring/
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2F2018%2Frelevance-scoring%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499777395&sdata=TbYqHGyZ4Cq6Zhx8FSr9ES90GVw%2BkHo7r5epAVYLlog%3D&reserved=0>. Uses
> random-projection forests which is a very clever technique.  (CC'ing Rene)
>
>
>
> ~ David
>
> On Fri, Mar 1, 2019 at 1:30 PM Pedram Rezaei <
> pedramr@microsoft.com.invalid> wrote:
>
> Hi there,
>
>
>
> Thank you for the responses. Yes, we have a few scenarios in mind that can
> benefit from a vector-based index optimized for ANN searches:
>
>
>
>    - Advanced, optimized, and high precision visual search: For this to
>    work, we would convert the images to their vector representations and then
>    use algorithms and implementations such as SPTAG
>    <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMicrosoft%2FSPTAG&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499777395&sdata=qRd%2B5ieCH2duJVxBxHbj4rVy03cHhbW2QxFGLJ6F%2BNs%3D&reserved=0>
>    , FAISS
>    <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookresearch%2Ffaiss&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389&sdata=%2BWivx1i5cTAypkWJUaWXLq32ShZ9ncPEIuUzcV5lqtk%3D&reserved=0>,
>    and HNSWLIB
>    <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnmslib%2Fhnswlib&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389&sdata=2ZxNZFReYuryCjGak9Szz5BmgjT9G59IBOw9q3RlCbo%3D&reserved=0>
>    .
>    - Advanced document retrieval: Using a numerical vector representation
>    of a document, we could improve the search result
>    - Nearest neighbor queries: discovering the nearest neighbors to a
>    given query could also benefit from these ANN algorithms (although doesn’t
>    necessarily need the vector based index)
>
>
>
> I would be grateful to hear your thoughts and whether the community is
> open to a conversation on this topic with my team.
>
>
>
> Thanks,
>
>
>
> Pedram
>
>
>
> *From:* J. Delgado <jo...@gmail.com>
> *Sent:* Thursday, February 28, 2019 7:38 AM
> *To:* dev@lucene.apache.org
> *Cc:* Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>
> *Subject:* Re: Vector based store and ANN
>
>
>
> Lucene’s scoring function (which I believe is okapi BM25
>
> https://en.m.wikipedia.org/wiki/Okapi_BM25
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499797379&sdata=E0%2BLqnkwPxvJlL2ENYKgv0HDQxyPkB6iRw467PMBmRY%3D&reserved=0>)
> is a kind of nearest neighbor using the TF-IDF vector representation of
> documents and query. Are you interested in ANN to be applied to a different
> kind of vector representation, say for example Doc2Vec?
>
>
>
> On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand <jp...@gmail.com> wrote:
>
> Hi Pedram,
>
> We don't have much in this area, but I'm hearing increasing interest
> so it'd be nice to get better there! The closest that we have is this
> class that can search for nearest neighbors for a vector of up to 8
> dimensions:
> https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499807382&sdata=GvDfvwmayyPuk%2FyzdRwV6iz4dvEZNyZ%2FFjl%2BjKYKCAM%3D&reserved=0>
> .
>
> On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
> <pe...@microsoft.com.invalid> wrote:
> >
> > Hi there,
> >
> >
> >
> > Is there a way to store numerical vectors (vector based index) and
> perform search based on Approximate Nearest Neighbor class of algorithms in
> Lucene?
> >
> >
> >
> > If not, has there been any interests in the topic so far?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Pedram
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
>
> Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
>
> LinkedIn: http://linkedin.com/in/davidwsmiley
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flinkedin.com%2Fin%2Fdavidwsmiley&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499807382&sdata=f4y0dYTDXxe7HMCZMbk9d5S%2BX8q93Yo7CkROITsyeNo%3D&reserved=0>
>  | Book: http://www.solrenterprisesearchserver.com
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.solrenterprisesearchserver.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499817365&sdata=9pkGzZID%2FeuGEdd90ZOrpRUybWLVV2H7vHUO4kp9%2FA4%3D&reserved=0>
>
>
>


-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

RE: Vector based store and ANN

Posted by Pedram Rezaei <pe...@microsoft.com.INVALID>.
Merging the threads and pasting all the replies into here and responding to them below:

Thank you all for your detailed and thoughtful contributions.

Here at Bing, we used to use the coarse approximation nearest neighbor approach (using something similar to the LSH hashing technique) on the inverted index and a finer-grained final rescoring method. However, for Bing, we have seen a visible impact on relevance using ANN. This even applies to smaller indexes with 20M records. We also find that recall varies on LSH on our interested dataset. Hence we adopted KD-tree & RNG which has more stable recall. The algorithm is open sourced here<https://github.com/Microsoft/SPTAG>. We have also seen success with HNSW and FAISS.

The links provided by Doug and J. are attempting to add vectors to the existing index. These solutions typically inefficient on medium to large size indexes if used for online querying as they tend to behave more like a linear search. The author of EsAknn has also alluded to this on its github page<https://github.com/alexklibisz/elastik-nearest-neighbors/>:

“If you need to quickly run KNN on an extremely large corpus in an offline job, use one of the libraries from Ann-Benchmarks<https://github.com/erikbern/ann-benchmarks>. If you need KNN in an online setting with support for horizontally-scalable searching and indexing new vectors in near-real-time, consider EsAknn (especially if you already use Elasticsearch).”

Using a vector-based index tuned for ANN searches, with an ability to hook in other index formats and algorithms as Rene requested below, we can provide a solution that can, for example, index and serve hundreds of millions of images and offers fast query over those indexes. We use these algorithms and indexes that are referenced above for image and text search. The user can choose the most relevant one or even combine multiple of those before the final scoring.

I would love to hear your thoughts on this and see if the community is open to a proposal by Bing on contributing some of its tech to Lucene. We will run the design and the development incrementally with the full input from the community.

Thanks,

Pedram

From: Doug Turnbull <dt...@opensourceconnections.com>
Sent: Saturday, March 2, 2019 3:50 PM
To: dev@lucene.apache.org
Cc: Pedram Rezaei <pe...@microsoft.com>; Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>; Arun Sacheti <ar...@bing.com>; Kun Wu <Wu...@microsoft.com>; Junhua Wang <Ju...@microsoft.com>; Jason Li <ja...@microsoft.com>
Subject: Re: Vector based store and ANN

I'll add that Elasticsearch has a vector scoring (though not filtering/matching) coming in to Elasticsearch mainline by Mayya Sharipova

https://github.com/elastic/elasticsearch/pull/33022<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Felastic%2Felasticsearch%2Fpull%2F33022&data=02%7C01%7Cpedramr%40microsoft.com%7C199a871ace7f4bc9898408d69f69e825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636871674551616746&sdata=2GS4Wwienaw3Bnb0kHaA1MKooVVNfkreG9f4F3FPfSk%3D&reserved=0>

It uses doc values to do some reranking using standard similarities. It's a start, hopefully something that can be built upon

Hoping Mayya can be at Haystack... vector filtering/similarities/use cases could even be its own breakout/collaboration session

From: René Kriegler <rk...@rene-kriegler.de>
Sent: Saturday, March 2, 2019 3:23 PM
To: J. Delgado <jo...@gmail.com>
Cc: dev@lucene.apache.org; Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>; Arun Sacheti <ar...@bing.com>; Kun Wu <Wu...@microsoft.com>; Junhua Wang <Ju...@microsoft.com>; Jason Li <ja...@microsoft.com>
Subject: Re: Vector based store and ANN

Thanks for the links, Joaquin!

Yet another thought related to an implementation at Lucene level: I wonder how much sense it makes to try to implement a one-approach-fits-all solution for vector-based retrieval. We have different expectations of a solution, depending on aspects such as vector dimensionality, domain (text vs. image recognition vs. …) and retrieval quality priorities (recall vs precision). I think that was also reflected in the Slack discussion. I think it would be very helpful to have real-life vector datasets (labelled for specific retrieval tasks), so that we could benchmarks solutions for retrieval speed and quality metrics. For example, we could easily create synthetic vector datasets for KNN search (which is still a good starting point!) - but using random vectors probably doesn’t reflect the distribution we would normally face in an image search or when searching by word embeddings.

Best,
René

On 2 Mar 2019, at 22:06, J. Delgado <jo...@gmail.com>> wrote:

Apparently, there is already an implementation along the lines discussed here:

https://blog.insightdatascience.com/elastik-nearest-neighbors-4b1f6821bd62<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.insightdatascience.com%2Felastik-nearest-neighbors-4b1f6821bd62&data=02%7C01%7Cpedramr%40microsoft.com%7C11383da6057f4f387d8e08d69f66159f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636871658141994760&sdata=pV5mDyNJtpXYZmaggUtAeWL%2BrcqbX%2BSp2peyhp%2FcV2k%3D&reserved=0>
https://github.com/alexklibisz/elastik-nearest-neighbors/<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Falexklibisz%2Felastik-nearest-neighbors%2F&data=02%7C01%7Cpedramr%40microsoft.com%7C11383da6057f4f387d8e08d69f66159f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636871658142004765&sdata=0IX%2BGS6XPI0FsBigge98XJxrlXA9gTFYekVkQoHpHIo%3D&reserved=0>

From: J. Delgado <jo...@gmail.com>
Sent: Friday, March 1, 2019 3:26 PM
To: dev@lucene.apache.org
Cc: Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>; Arun Sacheti <ar...@bing.com>; Kun Wu <Wu...@microsoft.com>; Junhua Wang <Ju...@microsoft.com>; Jason Li <ja...@microsoft.com>; René Kriegler <po...@rene-kriegler.com>
Subject: Re: Vector based store and ANN

Traditional search engines work both as a retrieval engine, with the support of arbitrarily complex BOOLEAN queries and a scoring engine that performs vector-based similarity computations. It works very well for words (terms) because of the clever inverted index and posting list data structures, used to represent a very sparse matrix that associate terms/weights with documents.  I'm not so sure if these core properties of a search engine can be generalized to performing the selection with an ANN algorithm such as LSH and then do a more sophisticated scoring function. Notice that doing nearest neighbor inherently doing a top-k selection.  As stated in Rene's presentation it can work with mages recognition vectors (embeddings) by implementing Random Projection Forest and indexing random projections and defining hyperplanes instead of the full high-dimensional vector, which is an interesting approach. It reminds me of the use of Geohash and Isocrones  in Doordash's search (see https://medium.com/@DoorDash/how-we-designed-road-distances-in-doordash-search-913ef8434099<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmedium.com%2F%40DoorDash%2Fhow-we-designed-road-distances-in-doordash-search-913ef8434099&data=02%7C01%7Cpedramr%40microsoft.com%7C40fe5c3544fb4041530c08d69e9d59b5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870795991645234&sdata=Yeuq9ul1XRIPGYY9f35DWxoDGEOP3dU0qjuzTcSBAzQ%3D&reserved=0>)

I've been working in ML Scoring within search (traditonal ML/Learning to Rank and recently Deep Learning), which has worked well in my previous lives and now at Groupon. See various presentation I have given on the topic since 2015:

https://www.youtube.com/watch?v=x-tLA8eZs1k<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dx-tLA8eZs1k&data=02%7C01%7Cpedramr%40microsoft.com%7C40fe5c3544fb4041530c08d69e9d59b5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870795991655243&sdata=LikvaSF5RibDN%2FEp3YTHjhB9FA%2FWm4QCf9UErTyxrKo%3D&reserved=0>
https://www.slideshare.net/SDianaHu/recsys-2015-tutorial-scalable-recommender-systems-where-machine-learning-meets-search<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FSDianaHu%2Frecsys-2015-tutorial-scalable-recommender-systems-where-machine-learning-meets-search&data=02%7C01%7Cpedramr%40microsoft.com%7C40fe5c3544fb4041530c08d69e9d59b5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870795991655243&sdata=lUpXzOR3gBt9sBYOTGTum88UxvnpVFsuVM0Pkfvld6U%3D&reserved=0>
https://www.slideshare.net/bojanbabic/deep-learning-application-within-search-and-ranking-at-groupon<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fbojanbabic%2Fdeep-learning-application-within-search-and-ranking-at-groupon&data=02%7C01%7Cpedramr%40microsoft.com%7C40fe5c3544fb4041530c08d69e9d59b5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870795991665248&sdata=fNgaaGoXH%2FyDZvq3cenzNCX2%2Bx3XnlTH1lXA7IpGoaw%3D&reserved=0>



Thanks!

-- J

From: Pedram Rezaei <pe...@microsoft.com.INVALID>
Sent: Friday, March 1, 2019 4:23 PM
To: René Kriegler <rk...@rene-kriegler.de>; dev@lucene.apache.org
Cc: Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>; Arun Sacheti <ar...@bing.com>; Kun Wu <Wu...@microsoft.com>; Junhua Wang <Ju...@microsoft.com>; Jason Li <ja...@microsoft.com>
Subject: RE: Vector based store and ANN

Hi there,

Thank you for sharing your thoughts. I am finding them extremely useful and to be honest exciting!

Regarding the vector-based scoring, you are 100% correct. There are many ways of having an efficient vector-based similarity scorer implemented on top of an encoded vector stored at the document level in Lucene.

As you have rightly pointed out, this in itself might not be sufficient for large indexes. After all, the engine would need to read the vector per document and then calculate similarity.

LSH or similar n-pass (n>1) techniques are pretty interesting and certainly can get us closer to using the existing index for lookup. As you rightly point out below, it can come at a cost either to the performance or the precision.

I am personally very intrigued by the new generation of vector-based indexes such as Facebook’s FAISS<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookresearch%2Ffaiss&data=02%7C01%7Cpedramr%40microsoft.com%7Cf29cd53595ce40d159cb08d69ea53d85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870829889953357&sdata=zk8eLpvAVh4my%2FlBFAiug0CJhEka3YySgAMkhQitem8%3D&reserved=0> library for similarity search and clustering of dense vectors used as part of larger search offerings. Do you think there might be a world in which Lucene would want to have first-class support for vector-based searches? I think with such a capability, we might open the door for new and innovative ways of information retrieval.

I am grateful to you all for your insights and this fascinating discussion!

Pedram

P.S. How do I join https://relevancy.slack.com<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Frelevancy.slack.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cf29cd53595ce40d159cb08d69ea53d85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870829889953357&sdata=VOMtBzm9uZaWtA%2FBkMEZ3ZwFVddmSVgSg%2BFLzuLOnYo%3D&reserved=0>?

From: René Kriegler <rk...@rene-kriegler.de>>
Sent: Friday, March 1, 2019 3:24 PM
To: Pedram Rezaei <pe...@microsoft.com>>
Cc: dev@lucene.apache.org<ma...@lucene.apache.org>; Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>>; Arun Sacheti <ar...@bing.com>>; Kun Wu <Wu...@microsoft.com>>; Junhua Wang <Ju...@microsoft.com>>; Jason Li <ja...@microsoft.com>>
Subject: Re: Vector based store and ANN

Hi there,

Thank you for looping me in. Just a few random thoughts on this topic:

- I’ve heard ;-) that this ES plugin is fast for vector-based scoring: https://github.com/StaySense/fast-cosine-similarity<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FStaySense%2Ffast-cosine-similarity&data=02%7C01%7Cpedramr%40microsoft.com%7Cf29cd53595ce40d159cb08d69ea53d85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870829889963362&sdata=OjKqNF29TXJnPN4leyc%2FTnLJtOCh0V8f5SBP%2FTYx1TQ%3D&reserved=0>. The links in the ‘General’ section provide some history. As far as I can see, there is nothing which couldn’t be implemented at Lucene level.

- For retrieval, I think a two-pass approach looks like something worth trying out. First pass: look up documents in a low dimensional space (maybe produced via LSH) and then, in the second pass, calculate vector distances in the high-dimensional space just for the documents that resulted from the first pass. This solution will come with some compromises to make. For example, a higher dimensionality of LSH would increase precision but also produce more hash tokens and make the lookup slower, especially for large indexes.

- Day 2 of Haystack 2019 (https://haystackconf.com/agenda/<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2Fagenda%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cf29cd53595ce40d159cb08d69ea53d85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870829889973371&sdata=NUzhe7jgW6p%2FX5U%2BscbhHBZj3d0H%2B0OWWay9JwTXbiE%3D&reserved=0>) will have a talk by Simon Hughes about ’Search with Vectors’. There is a channel on this topic at OpenSource Connections’ search relevance Slack (https://relevancy.slack.com<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Frelevancy.slack.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cf29cd53595ce40d159cb08d69ea53d85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870829889973371&sdata=Et46izIaX7OSlAS%2FXGfBf5NzRmcR5JaMcRjZHscFIfE%3D&reserved=0>) and Simon has been one of the drivers of the discussion.

Best,
René


On 1 Mar 2019, at 20:51, Pedram Rezaei <pe...@microsoft.com>> wrote:

Thank you for sharing, and it is exciting to see how advanced your thinking is.

Yes, the idea is the same idea with an extra step that Rene also seems to elude to here<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRenKriegler%2Fa-picture-is-worth-a-thousand-words-93680178&data=02%7C01%7Cpedramr%40microsoft.com%7Cf29cd53595ce40d159cb08d69ea53d85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870829889983376&sdata=eDt8QhZv7QiunpR1DlYqEtU%2BDn5%2BxytFDDwDcNn4pgE%3D&reserved=0> in his comment. Instead of using these types of techniques only at the scoring time, we can use them for information retrieval from the index. This will allow us to, for example, index millions of images and quickly and efficiently lookup the most relevant images.

I would love to hear yours and others thoughts on this. I think there is a great opportunity here, but it would need a lot of input and guidance from the experts here.

Thank you,

Pedram

From: David Smiley <da...@gmail.com>>
Sent: Friday, March 1, 2019 12:11 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>>; Arun Sacheti <ar...@bing.com>>; Kun Wu <Wu...@microsoft.com>>; Junhua Wang <Ju...@microsoft.com>>; Jason Li <ja...@microsoft.com>>; René Kriegler <po...@rene-kriegler.com>>
Subject: Re: Vector based store and ANN

This presentation by Rene Kriegler at Haystack 2018 was a real eye-opener to me on this subject: https://haystackconf.com/2018/relevance-scoring/<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2F2018%2Frelevance-scoring%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cf29cd53595ce40d159cb08d69ea53d85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870829889983376&sdata=DRkFRcsr6aFOheaKEOMux3VSklNvOjhvxhwZ3g%2Bc%2FK0%3D&reserved=0>. Uses random-projection forests which is a very clever technique.  (CC'ing Rene)

~ David
On Fri, Mar 1, 2019 at 1:30 PM Pedram Rezaei <pe...@microsoft.com.invalid>> wrote:
Hi there,

Thank you for the responses. Yes, we have a few scenarios in mind that can benefit from a vector-based index optimized for ANN searches:


  *   Advanced, optimized, and high precision visual search: For this to work, we would convert the images to their vector representations and then use algorithms and implementations such as SPTAG<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMicrosoft%2FSPTAG&data=02%7C01%7Cpedramr%40microsoft.com%7Cf29cd53595ce40d159cb08d69ea53d85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870829889993385&sdata=2H2VZGbc5zpru9FwJ%2BVZxPd%2Bb%2BgPY7d7qoNISONJx6c%3D&reserved=0>, FAISS<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookresearch%2Ffaiss&data=02%7C01%7Cpedramr%40microsoft.com%7Cf29cd53595ce40d159cb08d69ea53d85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870829889993385&sdata=CKAwCKTV5gjPwML%2FD%2B6TeCqBR8R67bkWpoX%2BSY2jfH0%3D&reserved=0>, and HNSWLIB<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnmslib%2Fhnswlib&data=02%7C01%7Cpedramr%40microsoft.com%7Cf29cd53595ce40d159cb08d69ea53d85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870829890003390&sdata=5qDUkuLd1bAEO6KsqbuhlHQy8n7V8OkjA%2BQZGi3Mjc4%3D&reserved=0>.
  *   Advanced document retrieval: Using a numerical vector representation of a document, we could improve the search result
  *   Nearest neighbor queries: discovering the nearest neighbors to a given query could also benefit from these ANN algorithms (although doesn’t necessarily need the vector based index)

I would be grateful to hear your thoughts and whether the community is open to a conversation on this topic with my team.

Thanks,

Pedram

From: J. Delgado <jo...@gmail.com>>
Sent: Thursday, February 28, 2019 7:38 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>>
Subject: Re: Vector based store and ANN

Lucene’s scoring function (which I believe is okapi BM25
https://en.m.wikipedia.org/wiki/Okapi_BM25<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25&data=02%7C01%7Cpedramr%40microsoft.com%7Cf29cd53595ce40d159cb08d69ea53d85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870829890003390&sdata=IyKcS6Jp97UumA2BdvroSjTsL2fs4qTbfwjCxUhNKfE%3D&reserved=0>) is a kind of nearest neighbor using the TF-IDF vector representation of documents and query. Are you interested in ANN to be applied to a different kind of vector representation, say for example Doc2Vec?

On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand <jp...@gmail.com>> wrote:
Hi Pedram,

We don't have much in this area, but I'm hearing increasing interest
so it'd be nice to get better there! The closest that we have is this
class that can search for nearest neighbors for a vector of up to 8
dimensions: https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java&data=02%7C01%7Cpedramr%40microsoft.com%7Cf29cd53595ce40d159cb08d69ea53d85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870829890013399&sdata=%2BJ9gQGIkOiYGpuCKqvJlMxik5WFMbnh%2F2XtXesuh5pc%3D&reserved=0>.

On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
<pe...@microsoft.com.invalid>> wrote:
>
> Hi there,
>
>
>
> Is there a way to store numerical vectors (vector based index) and perform search based on Approximate Nearest Neighbor class of algorithms in Lucene?
>
>
>
> If not, has there been any interests in the topic so far?
>
>
>
> Thanks,
>
>
>
> Pedram



--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
--
Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flinkedin.com%2Fin%2Fdavidwsmiley&data=02%7C01%7Cpedramr%40microsoft.com%7Cf29cd53595ce40d159cb08d69ea53d85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870829890023404&sdata=wp5btTqvlvRjFu77Ajd0jWy7Qm7fkgvxI1uYyM%2BzQTA%3D&reserved=0> | Book: http://www.solrenterprisesearchserver.com<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.solrenterprisesearchserver.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cf29cd53595ce40d159cb08d69ea53d85%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870829890023404&sdata=Id3QGyvgTWXwegMuCALYnVpKCChXrY8vrw9JaAbPaxI%3D&reserved=0>


Re: Vector based store and ANN

Posted by René Kriegler <rk...@rene-kriegler.de>.
Thanks for the links, Joaquin!

Yet another thought related to an implementation at Lucene level: I wonder how much sense it makes to try to implement a one-approach-fits-all solution for vector-based retrieval. We have different expectations of a solution, depending on aspects such as vector dimensionality, domain (text vs. image recognition vs. …) and retrieval quality priorities (recall vs precision). I think that was also reflected in the Slack discussion. I think it would be very helpful to have real-life vector datasets (labelled for specific retrieval tasks), so that we could benchmarks solutions for retrieval speed and quality metrics. For example, we could easily create synthetic vector datasets for KNN search (which is still a good starting point!) - but using random vectors probably doesn’t reflect the distribution we would normally face in an image search or when searching by word embeddings.

Best,
René


> On 2 Mar 2019, at 22:06, J. Delgado <jo...@gmail.com> wrote:
> 
> Apparently, there is already an implementation along the lines discussed here:
> 
> https://blog.insightdatascience.com/elastik-nearest-neighbors-4b1f6821bd62 <https://blog.insightdatascience.com/elastik-nearest-neighbors-4b1f6821bd62>
> https://github.com/alexklibisz/elastik-nearest-neighbors/ <https://github.com/alexklibisz/elastik-nearest-neighbors/>
> 
> -- J
> 
> On Fri, Mar 1, 2019 at 4:23 PM Pedram Rezaei <pe...@microsoft.com.invalid> wrote:
> Hi there,
> 
>  
> 
> Thank you for sharing your thoughts. I am finding them extremely useful and to be honest exciting!
> 
>  
> 
> Regarding the vector-based scoring, you are 100% correct. There are many ways of having an efficient vector-based similarity scorer implemented on top of an encoded vector stored at the document level in Lucene.
> 
>  
> 
> As you have rightly pointed out, this in itself might not be sufficient for large indexes. After all, the engine would need to read the vector per document and then calculate similarity.
> 
>  
> 
> LSH or similar n-pass (n>1) techniques are pretty interesting and certainly can get us closer to using the existing index for lookup. As you rightly point out below, it can come at a cost either to the performance or the precision.
> 
>  
> 
> I am personally very intrigued by the new generation of vector-based indexes such as Facebook’s FAISS <https://github.com/facebookresearch/faiss> library for similarity search and clustering of dense vectors used as part of larger search offerings. Do you think there might be a world in which Lucene would want to have first-class support for vector-based searches? I think with such a capability, we might open the door for new and innovative ways of information retrieval.
> 
>  
> 
> I am grateful to you all for your insights and this fascinating discussion!
> 
>  
> 
> Pedram
> 
>  
> 
> P.S. How do I join https://relevancy.slack.com <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Frelevancy.slack.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=p41mJtw39mq5qG5oy3MZOEWHT%2BrfFqeANLhFcLOtIIo%3D&reserved=0>?
> 
>  
> 
> From: René Kriegler <rk@rene-kriegler.de <ma...@rene-kriegler.de>> 
> Sent: Friday, March 1, 2019 3:24 PM
> To: Pedram Rezaei <pedramr@microsoft.com <ma...@microsoft.com>>
> Cc: dev@lucene.apache.org <ma...@lucene.apache.org>; Radhakrishnan Srikanth (SRIKANTH) <rsrikan@microsoft.com <ma...@microsoft.com>>; Arun Sacheti <aruns@bing.com <ma...@bing.com>>; Kun Wu <Wu.Kun@microsoft.com <ma...@microsoft.com>>; Junhua Wang <Junhua.Wang@microsoft.com <ma...@microsoft.com>>; Jason Li <jasol@microsoft.com <ma...@microsoft.com>>
> Subject: Re: Vector based store and ANN
> 
>  
> 
> Hi there,
> 
>  
> 
> Thank you for looping me in. Just a few random thoughts on this topic: 
> 
>  
> 
> - I’ve heard ;-) that this ES plugin is fast for vector-based scoring: https://github.com/StaySense/fast-cosine-similarity <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FStaySense%2Ffast-cosine-similarity&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499747411&sdata=kHEdnvi3o9ZfSAiE%2FJQIhbI54Zf%2BLEwr%2F%2B40tpFDnv8%3D&reserved=0>. The links in the ‘General’ section provide some history. As far as I can see, there is nothing which couldn’t be implemented at Lucene level.
> 
>  
> 
> - For retrieval, I think a two-pass approach looks like something worth trying out. First pass: look up documents in a low dimensional space (maybe produced via LSH) and then, in the second pass, calculate vector distances in the high-dimensional space just for the documents that resulted from the first pass. This solution will come with some compromises to make. For example, a higher dimensionality of LSH would increase precision but also produce more hash tokens and make the lookup slower, especially for large indexes.
> 
>  
> 
> - Day 2 of Haystack 2019 (https://haystackconf.com/agenda/ <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2Fagenda%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=6cD7%2BHoxVsuozhLN27m7Jmowv3D4CUYtVHCipGRO8Ss%3D&reserved=0>) will have a talk by Simon Hughes about ’Search with Vectors’. There is a channel on this topic at OpenSource Connections’ search relevance Slack (https://relevancy.slack.com <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Frelevancy.slack.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=p41mJtw39mq5qG5oy3MZOEWHT%2BrfFqeANLhFcLOtIIo%3D&reserved=0>) and Simon has been one of the drivers of the discussion.
> 
>  
> 
> Best,
> 
> René
> 
>  
> 
> 
> 
> 
> On 1 Mar 2019, at 20:51, Pedram Rezaei <pedramr@microsoft.com <ma...@microsoft.com>> wrote:
> 
>  
> 
> Thank you for sharing, and it is exciting to see how advanced your thinking is.
> 
>  
> 
> Yes, the idea is the same idea with an extra step that Rene also seems to elude to here <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRenKriegler%2Fa-picture-is-worth-a-thousand-words-93680178&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499767411&sdata=BRJS4wkx7vRY8CX%2FiPvvltx41uy%2BwBAwtMEEoE1Gcag%3D&reserved=0> in his comment. Instead of using these types of techniques only at the scoring time, we can use them for information retrieval from the index. This will allow us to, for example, index millions of images and quickly and efficiently lookup the most relevant images.
> 
>  
> 
> I would love to hear yours and others thoughts on this. I think there is a great opportunity here, but it would need a lot of input and guidance from the experts here.
> 
>  
> 
> Thank you,
> 
>  
> 
> Pedram
> 
>  
> 
> From: David Smiley <david.w.smiley@gmail.com <ma...@gmail.com>> 
> Sent: Friday, March 1, 2019 12:11 PM
> To: dev@lucene.apache.org <ma...@lucene.apache.org>
> Cc: Radhakrishnan Srikanth (SRIKANTH) <rsrikan@microsoft.com <ma...@microsoft.com>>; Arun Sacheti <aruns@bing.com <ma...@bing.com>>; Kun Wu <Wu.Kun@microsoft.com <ma...@microsoft.com>>; Junhua Wang <Junhua.Wang@microsoft.com <ma...@microsoft.com>>; Jason Li <jasol@microsoft.com <ma...@microsoft.com>>; René Kriegler <post@rene-kriegler.com <ma...@rene-kriegler.com>>
> Subject: Re: Vector based store and ANN
> 
>  
> 
> This presentation by Rene Kriegler at Haystack 2018 was a real eye-opener to me on this subject: https://haystackconf.com/2018/relevance-scoring/ <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2F2018%2Frelevance-scoring%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499777395&sdata=TbYqHGyZ4Cq6Zhx8FSr9ES90GVw%2BkHo7r5epAVYLlog%3D&reserved=0>. Uses random-projection forests which is a very clever technique.  (CC'ing Rene)
> 
>  
> 
> ~ David
> 
> On Fri, Mar 1, 2019 at 1:30 PM Pedram Rezaei <pedramr@microsoft.com.invalid <ma...@microsoft.com.invalid>> wrote:
> 
> Hi there,
> 
>  
> 
> Thank you for the responses. Yes, we have a few scenarios in mind that can benefit from a vector-based index optimized for ANN searches:
> 
>  
> 
> Advanced, optimized, and high precision visual search: For this to work, we would convert the images to their vector representations and then use algorithms and implementations such as SPTAG <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMicrosoft%2FSPTAG&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499777395&sdata=qRd%2B5ieCH2duJVxBxHbj4rVy03cHhbW2QxFGLJ6F%2BNs%3D&reserved=0>, FAISS <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookresearch%2Ffaiss&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389&sdata=%2BWivx1i5cTAypkWJUaWXLq32ShZ9ncPEIuUzcV5lqtk%3D&reserved=0>, and HNSWLIB <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnmslib%2Fhnswlib&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389&sdata=2ZxNZFReYuryCjGak9Szz5BmgjT9G59IBOw9q3RlCbo%3D&reserved=0>.
> Advanced document retrieval: Using a numerical vector representation of a document, we could improve the search result
> Nearest neighbor queries: discovering the nearest neighbors to a given query could also benefit from these ANN algorithms (although doesn’t necessarily need the vector based index)
>  
> 
> I would be grateful to hear your thoughts and whether the community is open to a conversation on this topic with my team.
> 
>  
> 
> Thanks,
> 
>  
> 
> Pedram
> 
>  
> 
> From: J. Delgado <joaquin.delgado@gmail.com <ma...@gmail.com>> 
> Sent: Thursday, February 28, 2019 7:38 AM
> To: dev@lucene.apache.org <ma...@lucene.apache.org>
> Cc: Radhakrishnan Srikanth (SRIKANTH) <rsrikan@microsoft.com <ma...@microsoft.com>>
> Subject: Re: Vector based store and ANN
> 
>  
> 
> Lucene’s scoring function (which I believe is okapi BM25  
> 
> https://en.m.wikipedia.org/wiki/Okapi_BM25 <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499797379&sdata=E0%2BLqnkwPxvJlL2ENYKgv0HDQxyPkB6iRw467PMBmRY%3D&reserved=0>) is a kind of nearest neighbor using the TF-IDF vector representation of documents and query. Are you interested in ANN to be applied to a different kind of vector representation, say for example Doc2Vec?
> 
>  
> 
> On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand <jpountz@gmail.com <ma...@gmail.com>> wrote:
> 
> Hi Pedram,
> 
> We don't have much in this area, but I'm hearing increasing interest
> so it'd be nice to get better there! The closest that we have is this
> class that can search for nearest neighbors for a vector of up to 8
> dimensions: https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499807382&sdata=GvDfvwmayyPuk%2FyzdRwV6iz4dvEZNyZ%2FFjl%2BjKYKCAM%3D&reserved=0>.
> 
> On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
> <pedramr@microsoft.com.invalid <ma...@microsoft.com.invalid>> wrote:
> >
> > Hi there,
> >
> >
> >
> > Is there a way to store numerical vectors (vector based index) and perform search based on Approximate Nearest Neighbor class of algorithms in Lucene?
> >
> >
> >
> > If not, has there been any interests in the topic so far?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Pedram
> 
> 
> 
> -- 
> Adrien
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <ma...@lucene.apache.org>
> For additional commands, e-mail: dev-help@lucene.apache.org <ma...@lucene.apache.org>
> -- 
> 
> Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
> 
> LinkedIn: http://linkedin.com/in/davidwsmiley <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flinkedin.com%2Fin%2Fdavidwsmiley&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499807382&sdata=f4y0dYTDXxe7HMCZMbk9d5S%2BX8q93Yo7CkROITsyeNo%3D&reserved=0> | Book: http://www.solrenterprisesearchserver.com <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.solrenterprisesearchserver.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499817365&sdata=9pkGzZID%2FeuGEdd90ZOrpRUybWLVV2H7vHUO4kp9%2FA4%3D&reserved=0>
>  
> 


Re: Vector based store and ANN

Posted by "J. Delgado" <jo...@gmail.com>.
Apparently, there is already an implementation along the lines discussed
here:

https://blog.insightdatascience.com/elastik-nearest-neighbors-4b1f6821bd62
https://github.com/alexklibisz/elastik-nearest-neighbors/

-- J

On Fri, Mar 1, 2019 at 4:23 PM Pedram Rezaei <pe...@microsoft.com.invalid>
wrote:

> Hi there,
>
>
>
> Thank you for sharing your thoughts. I am finding them extremely useful
> and to be honest exciting!
>
>
>
> Regarding the vector-based scoring, you are 100% correct. There are many
> ways of having an efficient vector-based similarity scorer implemented on
> top of an encoded vector stored at the document level in Lucene.
>
>
>
> As you have rightly pointed out, this in itself might not be sufficient
> for large indexes. After all, the engine would need to read the vector per
> document and then calculate similarity.
>
>
>
> LSH or similar n-pass (n>1) techniques are pretty interesting and
> certainly can get us closer to using the existing index for lookup. As you
> rightly point out below, it can come at a cost either to the performance or
> the precision.
>
>
>
> I am personally very intrigued by the new generation of vector-based
> indexes such as Facebook’s FAISS
> <https://github.com/facebookresearch/faiss> library for similarity search
> and clustering of dense vectors used as part of larger search offerings. Do
> you think there might be a world in which Lucene would want to have
> first-class support for vector-based searches? I think with such a
> capability, we might open the door for new and innovative ways of
> information retrieval.
>
>
>
> I am grateful to you all for your insights and this fascinating discussion!
>
>
>
> Pedram
>
>
>
> P.S. How do I join https://relevancy.slack.com
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Frelevancy.slack.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=p41mJtw39mq5qG5oy3MZOEWHT%2BrfFqeANLhFcLOtIIo%3D&reserved=0>
> ?
>
>
>
> *From:* René Kriegler <rk...@rene-kriegler.de>
> *Sent:* Friday, March 1, 2019 3:24 PM
> *To:* Pedram Rezaei <pe...@microsoft.com>
> *Cc:* dev@lucene.apache.org; Radhakrishnan Srikanth (SRIKANTH) <
> rsrikan@microsoft.com>; Arun Sacheti <ar...@bing.com>; Kun Wu <
> Wu.Kun@microsoft.com>; Junhua Wang <Ju...@microsoft.com>; Jason Li <
> jasol@microsoft.com>
> *Subject:* Re: Vector based store and ANN
>
>
>
> Hi there,
>
>
>
> Thank you for looping me in. Just a few random thoughts on this topic:
>
>
>
> - I’ve heard ;-) that this ES plugin is fast for vector-based scoring:
> https://github.com/StaySense/fast-cosine-similarity
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FStaySense%2Ffast-cosine-similarity&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499747411&sdata=kHEdnvi3o9ZfSAiE%2FJQIhbI54Zf%2BLEwr%2F%2B40tpFDnv8%3D&reserved=0>.
> The links in the ‘General’ section provide some history. As far as I can
> see, there is nothing which couldn’t be implemented at Lucene level.
>
>
>
> - For retrieval, I think a two-pass approach looks like something worth
> trying out. First pass: look up documents in a low dimensional space (maybe
> produced via LSH) and then, in the second pass, calculate vector distances
> in the high-dimensional space just for the documents that resulted from the
> first pass. This solution will come with some compromises to make. For
> example, a higher dimensionality of LSH would increase precision but also
> produce more hash tokens and make the lookup slower, especially for large
> indexes.
>
>
>
> - Day 2 of Haystack 2019 (https://haystackconf.com/agenda/
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2Fagenda%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=6cD7%2BHoxVsuozhLN27m7Jmowv3D4CUYtVHCipGRO8Ss%3D&reserved=0>)
> will have a talk by Simon Hughes about ’Search with Vectors’. There is a
> channel on this topic at OpenSource Connections’ search relevance Slack (
> https://relevancy.slack.com
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Frelevancy.slack.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=p41mJtw39mq5qG5oy3MZOEWHT%2BrfFqeANLhFcLOtIIo%3D&reserved=0>)
> and Simon has been one of the drivers of the discussion.
>
>
>
> Best,
>
> René
>
>
>
>
>
> On 1 Mar 2019, at 20:51, Pedram Rezaei <pe...@microsoft.com> wrote:
>
>
>
> Thank you for sharing, and it is exciting to see how advanced your
> thinking is.
>
>
>
> Yes, the idea is the same idea with an extra step that Rene also seems to
> elude to here
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRenKriegler%2Fa-picture-is-worth-a-thousand-words-93680178&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499767411&sdata=BRJS4wkx7vRY8CX%2FiPvvltx41uy%2BwBAwtMEEoE1Gcag%3D&reserved=0>
>  in his comment. Instead of using these types of techniques only at the
> scoring time, we can use them for information retrieval from the index.
> This will allow us to, for example, index millions of images and quickly
> and efficiently lookup the most relevant images.
>
>
>
> I would love to hear yours and others thoughts on this. I think there is a
> great opportunity here, but it would need a lot of input and guidance from
> the experts here.
>
>
>
> Thank you,
>
>
>
> Pedram
>
>
>
> *From:* David Smiley <da...@gmail.com>
> *Sent:* Friday, March 1, 2019 12:11 PM
> *To:* dev@lucene.apache.org
> *Cc:* Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>; Arun
> Sacheti <ar...@bing.com>; Kun Wu <Wu...@microsoft.com>; Junhua Wang <
> Junhua.Wang@microsoft.com>; Jason Li <ja...@microsoft.com>; René Kriegler
> <po...@rene-kriegler.com>
> *Subject:* Re: Vector based store and ANN
>
>
>
> This presentation by Rene Kriegler at Haystack 2018 was a real eye-opener
> to me on this subject: https://haystackconf.com/2018/relevance-scoring/
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2F2018%2Frelevance-scoring%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499777395&sdata=TbYqHGyZ4Cq6Zhx8FSr9ES90GVw%2BkHo7r5epAVYLlog%3D&reserved=0>. Uses
> random-projection forests which is a very clever technique.  (CC'ing Rene)
>
>
>
> ~ David
>
> On Fri, Mar 1, 2019 at 1:30 PM Pedram Rezaei <
> pedramr@microsoft.com.invalid> wrote:
>
> Hi there,
>
>
>
> Thank you for the responses. Yes, we have a few scenarios in mind that can
> benefit from a vector-based index optimized for ANN searches:
>
>
>
>    - Advanced, optimized, and high precision visual search: For this to
>    work, we would convert the images to their vector representations and then
>    use algorithms and implementations such as SPTAG
>    <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMicrosoft%2FSPTAG&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499777395&sdata=qRd%2B5ieCH2duJVxBxHbj4rVy03cHhbW2QxFGLJ6F%2BNs%3D&reserved=0>
>    , FAISS
>    <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookresearch%2Ffaiss&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389&sdata=%2BWivx1i5cTAypkWJUaWXLq32ShZ9ncPEIuUzcV5lqtk%3D&reserved=0>,
>    and HNSWLIB
>    <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnmslib%2Fhnswlib&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389&sdata=2ZxNZFReYuryCjGak9Szz5BmgjT9G59IBOw9q3RlCbo%3D&reserved=0>
>    .
>    - Advanced document retrieval: Using a numerical vector representation
>    of a document, we could improve the search result
>    - Nearest neighbor queries: discovering the nearest neighbors to a
>    given query could also benefit from these ANN algorithms (although doesn’t
>    necessarily need the vector based index)
>
>
>
> I would be grateful to hear your thoughts and whether the community is
> open to a conversation on this topic with my team.
>
>
>
> Thanks,
>
>
>
> Pedram
>
>
>
> *From:* J. Delgado <jo...@gmail.com>
> *Sent:* Thursday, February 28, 2019 7:38 AM
> *To:* dev@lucene.apache.org
> *Cc:* Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>
> *Subject:* Re: Vector based store and ANN
>
>
>
> Lucene’s scoring function (which I believe is okapi BM25
>
> https://en.m.wikipedia.org/wiki/Okapi_BM25
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499797379&sdata=E0%2BLqnkwPxvJlL2ENYKgv0HDQxyPkB6iRw467PMBmRY%3D&reserved=0>)
> is a kind of nearest neighbor using the TF-IDF vector representation of
> documents and query. Are you interested in ANN to be applied to a different
> kind of vector representation, say for example Doc2Vec?
>
>
>
> On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand <jp...@gmail.com> wrote:
>
> Hi Pedram,
>
> We don't have much in this area, but I'm hearing increasing interest
> so it'd be nice to get better there! The closest that we have is this
> class that can search for nearest neighbors for a vector of up to 8
> dimensions:
> https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499807382&sdata=GvDfvwmayyPuk%2FyzdRwV6iz4dvEZNyZ%2FFjl%2BjKYKCAM%3D&reserved=0>
> .
>
> On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
> <pe...@microsoft.com.invalid> wrote:
> >
> > Hi there,
> >
> >
> >
> > Is there a way to store numerical vectors (vector based index) and
> perform search based on Approximate Nearest Neighbor class of algorithms in
> Lucene?
> >
> >
> >
> > If not, has there been any interests in the topic so far?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Pedram
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
>
> Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
>
> LinkedIn: http://linkedin.com/in/davidwsmiley
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flinkedin.com%2Fin%2Fdavidwsmiley&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499807382&sdata=f4y0dYTDXxe7HMCZMbk9d5S%2BX8q93Yo7CkROITsyeNo%3D&reserved=0>
>  | Book: http://www.solrenterprisesearchserver.com
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.solrenterprisesearchserver.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499817365&sdata=9pkGzZID%2FeuGEdd90ZOrpRUybWLVV2H7vHUO4kp9%2FA4%3D&reserved=0>
>
>
>

RE: Vector based store and ANN

Posted by Pedram Rezaei <pe...@microsoft.com.INVALID>.
Hi there,

Thank you for sharing your thoughts. I am finding them extremely useful and to be honest exciting!

Regarding the vector-based scoring, you are 100% correct. There are many ways of having an efficient vector-based similarity scorer implemented on top of an encoded vector stored at the document level in Lucene.

As you have rightly pointed out, this in itself might not be sufficient for large indexes. After all, the engine would need to read the vector per document and then calculate similarity.

LSH or similar n-pass (n>1) techniques are pretty interesting and certainly can get us closer to using the existing index for lookup. As you rightly point out below, it can come at a cost either to the performance or the precision.

I am personally very intrigued by the new generation of vector-based indexes such as Facebook’s FAISS<https://github.com/facebookresearch/faiss> library for similarity search and clustering of dense vectors used as part of larger search offerings. Do you think there might be a world in which Lucene would want to have first-class support for vector-based searches? I think with such a capability, we might open the door for new and innovative ways of information retrieval.

I am grateful to you all for your insights and this fascinating discussion!

Pedram

P.S. How do I join https://relevancy.slack.com<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Frelevancy.slack.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=p41mJtw39mq5qG5oy3MZOEWHT%2BrfFqeANLhFcLOtIIo%3D&reserved=0>?

From: René Kriegler <rk...@rene-kriegler.de>
Sent: Friday, March 1, 2019 3:24 PM
To: Pedram Rezaei <pe...@microsoft.com>
Cc: dev@lucene.apache.org; Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>; Arun Sacheti <ar...@bing.com>; Kun Wu <Wu...@microsoft.com>; Junhua Wang <Ju...@microsoft.com>; Jason Li <ja...@microsoft.com>
Subject: Re: Vector based store and ANN

Hi there,

Thank you for looping me in. Just a few random thoughts on this topic:

- I’ve heard ;-) that this ES plugin is fast for vector-based scoring: https://github.com/StaySense/fast-cosine-similarity<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FStaySense%2Ffast-cosine-similarity&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499747411&sdata=kHEdnvi3o9ZfSAiE%2FJQIhbI54Zf%2BLEwr%2F%2B40tpFDnv8%3D&reserved=0>. The links in the ‘General’ section provide some history. As far as I can see, there is nothing which couldn’t be implemented at Lucene level.

- For retrieval, I think a two-pass approach looks like something worth trying out. First pass: look up documents in a low dimensional space (maybe produced via LSH) and then, in the second pass, calculate vector distances in the high-dimensional space just for the documents that resulted from the first pass. This solution will come with some compromises to make. For example, a higher dimensionality of LSH would increase precision but also produce more hash tokens and make the lookup slower, especially for large indexes.

- Day 2 of Haystack 2019 (https://haystackconf.com/agenda/<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2Fagenda%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=6cD7%2BHoxVsuozhLN27m7Jmowv3D4CUYtVHCipGRO8Ss%3D&reserved=0>) will have a talk by Simon Hughes about ’Search with Vectors’. There is a channel on this topic at OpenSource Connections’ search relevance Slack (https://relevancy.slack.com<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Frelevancy.slack.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=p41mJtw39mq5qG5oy3MZOEWHT%2BrfFqeANLhFcLOtIIo%3D&reserved=0>) and Simon has been one of the drivers of the discussion.

Best,
René



On 1 Mar 2019, at 20:51, Pedram Rezaei <pe...@microsoft.com>> wrote:

Thank you for sharing, and it is exciting to see how advanced your thinking is.

Yes, the idea is the same idea with an extra step that Rene also seems to elude to here<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRenKriegler%2Fa-picture-is-worth-a-thousand-words-93680178&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499767411&sdata=BRJS4wkx7vRY8CX%2FiPvvltx41uy%2BwBAwtMEEoE1Gcag%3D&reserved=0> in his comment. Instead of using these types of techniques only at the scoring time, we can use them for information retrieval from the index. This will allow us to, for example, index millions of images and quickly and efficiently lookup the most relevant images.

I would love to hear yours and others thoughts on this. I think there is a great opportunity here, but it would need a lot of input and guidance from the experts here.

Thank you,

Pedram

From: David Smiley <da...@gmail.com>>
Sent: Friday, March 1, 2019 12:11 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>>; Arun Sacheti <ar...@bing.com>>; Kun Wu <Wu...@microsoft.com>>; Junhua Wang <Ju...@microsoft.com>>; Jason Li <ja...@microsoft.com>>; René Kriegler <po...@rene-kriegler.com>>
Subject: Re: Vector based store and ANN

This presentation by Rene Kriegler at Haystack 2018 was a real eye-opener to me on this subject: https://haystackconf.com/2018/relevance-scoring/<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2F2018%2Frelevance-scoring%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499777395&sdata=TbYqHGyZ4Cq6Zhx8FSr9ES90GVw%2BkHo7r5epAVYLlog%3D&reserved=0>. Uses random-projection forests which is a very clever technique.  (CC'ing Rene)

~ David
On Fri, Mar 1, 2019 at 1:30 PM Pedram Rezaei <pe...@microsoft.com.invalid>> wrote:
Hi there,

Thank you for the responses. Yes, we have a few scenarios in mind that can benefit from a vector-based index optimized for ANN searches:


  *   Advanced, optimized, and high precision visual search: For this to work, we would convert the images to their vector representations and then use algorithms and implementations such as SPTAG<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMicrosoft%2FSPTAG&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499777395&sdata=qRd%2B5ieCH2duJVxBxHbj4rVy03cHhbW2QxFGLJ6F%2BNs%3D&reserved=0>, FAISS<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookresearch%2Ffaiss&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389&sdata=%2BWivx1i5cTAypkWJUaWXLq32ShZ9ncPEIuUzcV5lqtk%3D&reserved=0>, and HNSWLIB<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnmslib%2Fhnswlib&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389&sdata=2ZxNZFReYuryCjGak9Szz5BmgjT9G59IBOw9q3RlCbo%3D&reserved=0>.
  *   Advanced document retrieval: Using a numerical vector representation of a document, we could improve the search result
  *   Nearest neighbor queries: discovering the nearest neighbors to a given query could also benefit from these ANN algorithms (although doesn’t necessarily need the vector based index)

I would be grateful to hear your thoughts and whether the community is open to a conversation on this topic with my team.

Thanks,

Pedram

From: J. Delgado <jo...@gmail.com>>
Sent: Thursday, February 28, 2019 7:38 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>>
Subject: Re: Vector based store and ANN

Lucene’s scoring function (which I believe is okapi BM25
https://en.m.wikipedia.org/wiki/Okapi_BM25<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499797379&sdata=E0%2BLqnkwPxvJlL2ENYKgv0HDQxyPkB6iRw467PMBmRY%3D&reserved=0>) is a kind of nearest neighbor using the TF-IDF vector representation of documents and query. Are you interested in ANN to be applied to a different kind of vector representation, say for example Doc2Vec?

On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand <jp...@gmail.com>> wrote:
Hi Pedram,

We don't have much in this area, but I'm hearing increasing interest
so it'd be nice to get better there! The closest that we have is this
class that can search for nearest neighbors for a vector of up to 8
dimensions: https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499807382&sdata=GvDfvwmayyPuk%2FyzdRwV6iz4dvEZNyZ%2FFjl%2BjKYKCAM%3D&reserved=0>.

On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
<pe...@microsoft.com.invalid>> wrote:
>
> Hi there,
>
>
>
> Is there a way to store numerical vectors (vector based index) and perform search based on Approximate Nearest Neighbor class of algorithms in Lucene?
>
>
>
> If not, has there been any interests in the topic so far?
>
>
>
> Thanks,
>
>
>
> Pedram



--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
--
Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flinkedin.com%2Fin%2Fdavidwsmiley&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499807382&sdata=f4y0dYTDXxe7HMCZMbk9d5S%2BX8q93Yo7CkROITsyeNo%3D&reserved=0> | Book: http://www.solrenterprisesearchserver.com<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.solrenterprisesearchserver.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499817365&sdata=9pkGzZID%2FeuGEdd90ZOrpRUybWLVV2H7vHUO4kp9%2FA4%3D&reserved=0>


Re: Vector based store and ANN

Posted by Doug Turnbull <dt...@opensourceconnections.com>.
I'll add that Elasticsearch has a vector scoring (though not
filtering/matching) coming in to Elasticsearch mainline by Mayya Sharipova

https://github.com/elastic/elasticsearch/pull/33022

It uses doc values to do some reranking using standard similarities. It's a
start, hopefully something that can be built upon

Hoping Mayya can be at Haystack... vector filtering/similarities/use cases
could even be its own breakout/collaboration session

On Fri, Mar 1, 2019 at 8:59 PM René Kriegler <rk...@rene-kriegler.de> wrote:

> Hi there,
>
> Thank you for looping me in. Just a few random thoughts on this topic:
>
> - I’ve heard ;-) that this ES plugin is fast for vector-based scoring:
> https://github.com/StaySense/fast-cosine-similarity. The links in the
> ‘General’ section provide some history. As far as I can see, there is
> nothing which couldn’t be implemented at Lucene level.
>
> - For retrieval, I think a two-pass approach looks like something worth
> trying out. First pass: look up documents in a low dimensional space (maybe
> produced via LSH) and then, in the second pass, calculate vector distances
> in the high-dimensional space just for the documents that resulted from the
> first pass. This solution will come with some compromises to make. For
> example, a higher dimensionality of LSH would increase precision but also
> produce more hash tokens and make the lookup slower, especially for large
> indexes.
>
> - Day 2 of Haystack 2019 (https://haystackconf.com/agenda/) will have a
> talk by Simon Hughes about ’Search with Vectors’. There is a channel on
> this topic at OpenSource Connections’ search relevance Slack (
> https://relevancy.slack.com) and Simon has been one of the drivers of the
> discussion.
>
> Best,
> René
>
>
> On 1 Mar 2019, at 20:51, Pedram Rezaei <pe...@microsoft.com> wrote:
>
> Thank you for sharing, and it is exciting to see how advanced your
> thinking is.
>
> Yes, the idea is the same idea with an extra step that Rene also seems to
> elude to here
> <https://www.slideshare.net/RenKriegler/a-picture-is-worth-a-thousand-words-93680178>
>  in his comment. Instead of using these types of techniques only at the
> scoring time, we can use them for information retrieval from the index.
> This will allow us to, for example, index millions of images and quickly
> and efficiently lookup the most relevant images.
>
> I would love to hear yours and others thoughts on this. I think there is a
> great opportunity here, but it would need a lot of input and guidance from
> the experts here.
>
> Thank you,
>
> Pedram
>
> *From:* David Smiley <da...@gmail.com>
> *Sent:* Friday, March 1, 2019 12:11 PM
> *To:* dev@lucene.apache.org
> *Cc:* Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>; Arun
> Sacheti <ar...@bing.com>; Kun Wu <Wu...@microsoft.com>; Junhua Wang <
> Junhua.Wang@microsoft.com>; Jason Li <ja...@microsoft.com>; René Kriegler
> <po...@rene-kriegler.com>
> *Subject:* Re: Vector based store and ANN
>
> This presentation by Rene Kriegler at Haystack 2018 was a real eye-opener
> to me on this subject: https://haystackconf.com/2018/relevance-scoring/
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2F2018%2Frelevance-scoring%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908753995&sdata=sD7ZF4x1iXIjJ1GDAwlc0lUWkTpkarEkd2SAXI5qev0%3D&reserved=0>. Uses
> random-projection forests which is a very clever technique.  (CC'ing Rene)
>
>
> ~ David
> On Fri, Mar 1, 2019 at 1:30 PM Pedram Rezaei <
> pedramr@microsoft.com.invalid> wrote:
>
> Hi there,
>
> Thank you for the responses. Yes, we have a few scenarios in mind that can
> benefit from a vector-based index optimized for ANN searches:
>
>
>    - Advanced, optimized, and high precision visual search: For this to
>    work, we would convert the images to their vector representations and then
>    use algorithms and implementations such as SPTAG
>    <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMicrosoft%2FSPTAG&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908763999&sdata=pOKRUksZ4sTsgtbE7eW88kiFLovTAQJRiPz%2F2LQXvCg%3D&reserved=0>
>    , FAISS
>    <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookresearch%2Ffaiss&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908763999&sdata=if7uUn9OysK1c%2FDh6qb7hLcWGuaDjU9W5gKF2JQzOrk%3D&reserved=0>,
>    and HNSWLIB
>    <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnmslib%2Fhnswlib&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908774009&sdata=%2BFHGSAWnlsfe%2BhLiimjz1T%2B3YMH90pO%2FXSi15Eszzmg%3D&reserved=0>
>    .
>    - Advanced document retrieval: Using a numerical vector representation
>    of a document, we could improve the search result
>    - Nearest neighbor queries: discovering the nearest neighbors to a
>    given query could also benefit from these ANN algorithms (although doesn’t
>    necessarily need the vector based index)
>
>
> I would be grateful to hear your thoughts and whether the community is
> open to a conversation on this topic with my team.
>
> Thanks,
>
> Pedram
>
> *From:* J. Delgado <jo...@gmail.com>
> *Sent:* Thursday, February 28, 2019 7:38 AM
> *To:* dev@lucene.apache.org
> *Cc:* Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>
> *Subject:* Re: Vector based store and ANN
>
> Lucene’s scoring function (which I believe is okapi BM25
> https://en.m.wikipedia.org/wiki/Okapi_BM25
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908774009&sdata=UsNUOOH88fog95sKTM%2FkgjYak5%2Bp%2F%2BWaMZYsMAgQ5MA%3D&reserved=0>)
> is a kind of nearest neighbor using the TF-IDF vector representation of
> documents and query. Are you interested in ANN to be applied to a different
> kind of vector representation, say for example Doc2Vec?
>
> On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand <jp...@gmail.com> wrote:
>
> Hi Pedram,
>
> We don't have much in this area, but I'm hearing increasing interest
> so it'd be nice to get better there! The closest that we have is this
> class that can search for nearest neighbors for a vector of up to 8
> dimensions:
> https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908784014&sdata=XrrdrkhWOHp8%2FYLGowJK5%2B3km0f04Nr6BxPFxbiRQdM%3D&reserved=0>
> .
>
> On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
> <pe...@microsoft.com.invalid> wrote:
> >
> > Hi there,
> >
> >
> >
> > Is there a way to store numerical vectors (vector based index) and
> perform search based on Approximate Nearest Neighbor class of algorithms in
> Lucene?
> >
> >
> >
> > If not, has there been any interests in the topic so far?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Pedram
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
> Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flinkedin.com%2Fin%2Fdavidwsmiley&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908794023&sdata=rmLY5WMZtQCZ99yumefC%2BQoglS4JeONfLShsj5qaWkU%3D&reserved=0>
>  | Book: http://www.solrenterprisesearchserver.com
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.solrenterprisesearchserver.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908794023&sdata=DZslOJYShNLZ9GOSpstuq85F%2FwVrFtnZIVDiXe%2F%2B0fw%3D&reserved=0>
>
>
>

-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Vector based store and ANN

Posted by René Kriegler <rk...@rene-kriegler.de>.
Hi there,

Thank you for looping me in. Just a few random thoughts on this topic: 

- I’ve heard ;-) that this ES plugin is fast for vector-based scoring: https://github.com/StaySense/fast-cosine-similarity. The links in the ‘General’ section provide some history. As far as I can see, there is nothing which couldn’t be implemented at Lucene level.

- For retrieval, I think a two-pass approach looks like something worth trying out. First pass: look up documents in a low dimensional space (maybe produced via LSH) and then, in the second pass, calculate vector distances in the high-dimensional space just for the documents that resulted from the first pass. This solution will come with some compromises to make. For example, a higher dimensionality of LSH would increase precision but also produce more hash tokens and make the lookup slower, especially for large indexes.

- Day 2 of Haystack 2019 (https://haystackconf.com/agenda/) will have a talk by Simon Hughes about ’Search with Vectors’. There is a channel on this topic at OpenSource Connections’ search relevance Slack (https://relevancy.slack.com) and Simon has been one of the drivers of the discussion.

Best,
René


> On 1 Mar 2019, at 20:51, Pedram Rezaei <pe...@microsoft.com> wrote:
> 
> Thank you for sharing, and it is exciting to see how advanced your thinking is.
>  
> Yes, the idea is the same idea with an extra step that Rene also seems to elude to here <https://www.slideshare.net/RenKriegler/a-picture-is-worth-a-thousand-words-93680178> in his comment. Instead of using these types of techniques only at the scoring time, we can use them for information retrieval from the index. This will allow us to, for example, index millions of images and quickly and efficiently lookup the most relevant images.
>  
> I would love to hear yours and others thoughts on this. I think there is a great opportunity here, but it would need a lot of input and guidance from the experts here.
>  
> Thank you,
>  
> Pedram
>  
> From: David Smiley <da...@gmail.com> 
> Sent: Friday, March 1, 2019 12:11 PM
> To: dev@lucene.apache.org
> Cc: Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>; Arun Sacheti <ar...@bing.com>; Kun Wu <Wu...@microsoft.com>; Junhua Wang <Ju...@microsoft.com>; Jason Li <ja...@microsoft.com>; René Kriegler <po...@rene-kriegler.com>
> Subject: Re: Vector based store and ANN
>  
> This presentation by Rene Kriegler at Haystack 2018 was a real eye-opener to me on this subject: https://haystackconf.com/2018/relevance-scoring/ <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2F2018%2Frelevance-scoring%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908753995&sdata=sD7ZF4x1iXIjJ1GDAwlc0lUWkTpkarEkd2SAXI5qev0%3D&reserved=0>. Uses random-projection forests which is a very clever technique.  (CC'ing Rene)
>  
> ~ David
> 
> On Fri, Mar 1, 2019 at 1:30 PM Pedram Rezaei <pedramr@microsoft.com.invalid <ma...@microsoft.com.invalid>> wrote:
> Hi there,
>  
> Thank you for the responses. Yes, we have a few scenarios in mind that can benefit from a vector-based index optimized for ANN searches:
>  
> Advanced, optimized, and high precision visual search: For this to work, we would convert the images to their vector representations and then use algorithms and implementations such as SPTAG <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMicrosoft%2FSPTAG&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908763999&sdata=pOKRUksZ4sTsgtbE7eW88kiFLovTAQJRiPz%2F2LQXvCg%3D&reserved=0>, FAISS <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookresearch%2Ffaiss&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908763999&sdata=if7uUn9OysK1c%2FDh6qb7hLcWGuaDjU9W5gKF2JQzOrk%3D&reserved=0>, and HNSWLIB <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnmslib%2Fhnswlib&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908774009&sdata=%2BFHGSAWnlsfe%2BhLiimjz1T%2B3YMH90pO%2FXSi15Eszzmg%3D&reserved=0>.
> Advanced document retrieval: Using a numerical vector representation of a document, we could improve the search result
> Nearest neighbor queries: discovering the nearest neighbors to a given query could also benefit from these ANN algorithms (although doesn’t necessarily need the vector based index)
>  
> I would be grateful to hear your thoughts and whether the community is open to a conversation on this topic with my team.
>  
> Thanks,
>  
> Pedram
>  
> From: J. Delgado <joaquin.delgado@gmail.com <ma...@gmail.com>> 
> Sent: Thursday, February 28, 2019 7:38 AM
> To: dev@lucene.apache.org <ma...@lucene.apache.org>
> Cc: Radhakrishnan Srikanth (SRIKANTH) <rsrikan@microsoft.com <ma...@microsoft.com>>
> Subject: Re: Vector based store and ANN
>  
> Lucene’s scoring function (which I believe is okapi BM25  
> https://en.m.wikipedia.org/wiki/Okapi_BM25 <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908774009&sdata=UsNUOOH88fog95sKTM%2FkgjYak5%2Bp%2F%2BWaMZYsMAgQ5MA%3D&reserved=0>) is a kind of nearest neighbor using the TF-IDF vector representation of documents and query. Are you interested in ANN to be applied to a different kind of vector representation, say for example Doc2Vec?
>  
> On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand <jpountz@gmail.com <ma...@gmail.com>> wrote:
> Hi Pedram,
> 
> We don't have much in this area, but I'm hearing increasing interest
> so it'd be nice to get better there! The closest that we have is this
> class that can search for nearest neighbors for a vector of up to 8
> dimensions: https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908784014&sdata=XrrdrkhWOHp8%2FYLGowJK5%2B3km0f04Nr6BxPFxbiRQdM%3D&reserved=0>.
> 
> On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
> <pedramr@microsoft.com.invalid <ma...@microsoft.com.invalid>> wrote:
> >
> > Hi there,
> >
> >
> >
> > Is there a way to store numerical vectors (vector based index) and perform search based on Approximate Nearest Neighbor class of algorithms in Lucene?
> >
> >
> >
> > If not, has there been any interests in the topic so far?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Pedram
> 
> 
> 
> -- 
> Adrien
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <ma...@lucene.apache.org>
> For additional commands, e-mail: dev-help@lucene.apache.org <ma...@lucene.apache.org>
> -- 
> Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flinkedin.com%2Fin%2Fdavidwsmiley&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908794023&sdata=rmLY5WMZtQCZ99yumefC%2BQoglS4JeONfLShsj5qaWkU%3D&reserved=0> | Book: http://www.solrenterprisesearchserver.com <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.solrenterprisesearchserver.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908794023&sdata=DZslOJYShNLZ9GOSpstuq85F%2FwVrFtnZIVDiXe%2F%2B0fw%3D&reserved=0>

Re: Vector based store and ANN

Posted by "J. Delgado" <jo...@gmail.com>.
Traditional search engines work both as a retrieval engine, with the
support of arbitrarily complex BOOLEAN queries and a scoring engine that
performs vector-based similarity computations. It works very well for words
(terms) because of the clever inverted index and posting list data
structures, used to represent a very sparse matrix that associate
terms/weights with documents.  I'm not so sure if these core properties of
a search engine can be generalized to performing the selection with an ANN
algorithm such as LSH and then do a more sophisticated scoring function.
Notice that doing nearest neighbor inherently doing a top-k selection.  As
stated in Rene's presentation it can work with mages recognition vectors
(embeddings) by implementing Random Projection Forest and indexing random
projections and defining hyperplanes instead of the full high-dimensional
vector, which is an interesting approach. It reminds me of the use of
Geohash and Isocrones  in Doordash's search (see
https://medium.com/@DoorDash/how-we-designed-road-distances-in-doordash-search-913ef8434099
)

I've been working in ML Scoring within search (traditonal ML/Learning to
Rank and recently Deep Learning), which has worked well in my previous
lives and now at Groupon. See various presentation I have given on the
topic since 2015:

https://www.youtube.com/watch?v=x-tLA8eZs1k
https://www.slideshare.net/SDianaHu/recsys-2015-tutorial-scalable-recommender-systems-where-machine-learning-meets-search
https://www.slideshare.net/bojanbabic/deep-learning-application-within-search-and-ranking-at-groupon



Thanks!

-- J

On Fri, Mar 1, 2019 at 12:58 PM Pedram Rezaei <pe...@microsoft.com.invalid>
wrote:

> Thank you for sharing, and it is exciting to see how advanced your
> thinking is.
>
>
>
> Yes, the idea is the same idea with an extra step that Rene also seems to
> elude to here
> <https://www.slideshare.net/RenKriegler/a-picture-is-worth-a-thousand-words-93680178>
> in his comment. Instead of using these types of techniques only at the
> scoring time, we can use them for information retrieval from the index.
> This will allow us to, for example, index millions of images and quickly
> and efficiently lookup the most relevant images.
>
>
>
> I would love to hear yours and others thoughts on this. I think there is a
> great opportunity here, but it would need a lot of input and guidance from
> the experts here.
>
>
>
> Thank you,
>
>
>
> Pedram
>
>
>
> *From:* David Smiley <da...@gmail.com>
> *Sent:* Friday, March 1, 2019 12:11 PM
> *To:* dev@lucene.apache.org
> *Cc:* Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>; Arun
> Sacheti <ar...@bing.com>; Kun Wu <Wu...@microsoft.com>; Junhua Wang <
> Junhua.Wang@microsoft.com>; Jason Li <ja...@microsoft.com>; René Kriegler
> <po...@rene-kriegler.com>
> *Subject:* Re: Vector based store and ANN
>
>
>
> This presentation by Rene Kriegler at Haystack 2018 was a real eye-opener
> to me on this subject: https://haystackconf.com/2018/relevance-scoring/
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2F2018%2Frelevance-scoring%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908753995&sdata=sD7ZF4x1iXIjJ1GDAwlc0lUWkTpkarEkd2SAXI5qev0%3D&reserved=0>. Uses
> random-projection forests which is a very clever technique.  (CC'ing Rene)
>
>
>
> ~ David
>
> On Fri, Mar 1, 2019 at 1:30 PM Pedram Rezaei <
> pedramr@microsoft.com.invalid> wrote:
>
> Hi there,
>
>
>
> Thank you for the responses. Yes, we have a few scenarios in mind that can
> benefit from a vector-based index optimized for ANN searches:
>
>
>
>    - Advanced, optimized, and high precision visual search: For this to
>    work, we would convert the images to their vector representations and then
>    use algorithms and implementations such as SPTAG
>    <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMicrosoft%2FSPTAG&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908763999&sdata=pOKRUksZ4sTsgtbE7eW88kiFLovTAQJRiPz%2F2LQXvCg%3D&reserved=0>,
>    FAISS
>    <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookresearch%2Ffaiss&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908763999&sdata=if7uUn9OysK1c%2FDh6qb7hLcWGuaDjU9W5gKF2JQzOrk%3D&reserved=0>,
>    and HNSWLIB
>    <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnmslib%2Fhnswlib&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908774009&sdata=%2BFHGSAWnlsfe%2BhLiimjz1T%2B3YMH90pO%2FXSi15Eszzmg%3D&reserved=0>
>    .
>    - Advanced document retrieval: Using a numerical vector representation
>    of a document, we could improve the search result
>    - Nearest neighbor queries: discovering the nearest neighbors to a
>    given query could also benefit from these ANN algorithms (although doesn’t
>    necessarily need the vector based index)
>
>
>
> I would be grateful to hear your thoughts and whether the community is
> open to a conversation on this topic with my team.
>
>
>
> Thanks,
>
>
>
> Pedram
>
>
>
> *From:* J. Delgado <jo...@gmail.com>
> *Sent:* Thursday, February 28, 2019 7:38 AM
> *To:* dev@lucene.apache.org
> *Cc:* Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>
> *Subject:* Re: Vector based store and ANN
>
>
>
> Lucene’s scoring function (which I believe is okapi BM25
>
> https://en.m.wikipedia.org/wiki/Okapi_BM25
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908774009&sdata=UsNUOOH88fog95sKTM%2FkgjYak5%2Bp%2F%2BWaMZYsMAgQ5MA%3D&reserved=0>)
> is a kind of nearest neighbor using the TF-IDF vector representation of
> documents and query. Are you interested in ANN to be applied to a different
> kind of vector representation, say for example Doc2Vec?
>
>
>
> On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand <jp...@gmail.com> wrote:
>
> Hi Pedram,
>
> We don't have much in this area, but I'm hearing increasing interest
> so it'd be nice to get better there! The closest that we have is this
> class that can search for nearest neighbors for a vector of up to 8
> dimensions:
> https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908784014&sdata=XrrdrkhWOHp8%2FYLGowJK5%2B3km0f04Nr6BxPFxbiRQdM%3D&reserved=0>
> .
>
> On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
> <pe...@microsoft.com.invalid> wrote:
> >
> > Hi there,
> >
> >
> >
> > Is there a way to store numerical vectors (vector based index) and
> perform search based on Approximate Nearest Neighbor class of algorithms in
> Lucene?
> >
> >
> >
> > If not, has there been any interests in the topic so far?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Pedram
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
>
> Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
>
> LinkedIn: http://linkedin.com/in/davidwsmiley
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flinkedin.com%2Fin%2Fdavidwsmiley&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908794023&sdata=rmLY5WMZtQCZ99yumefC%2BQoglS4JeONfLShsj5qaWkU%3D&reserved=0>
> | Book: http://www.solrenterprisesearchserver.com
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.solrenterprisesearchserver.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908794023&sdata=DZslOJYShNLZ9GOSpstuq85F%2FwVrFtnZIVDiXe%2F%2B0fw%3D&reserved=0>
>

RE: Vector based store and ANN

Posted by Pedram Rezaei <pe...@microsoft.com.INVALID>.
Thank you for sharing, and it is exciting to see how advanced your thinking is.

Yes, the idea is the same idea with an extra step that Rene also seems to elude to here<https://www.slideshare.net/RenKriegler/a-picture-is-worth-a-thousand-words-93680178> in his comment. Instead of using these types of techniques only at the scoring time, we can use them for information retrieval from the index. This will allow us to, for example, index millions of images and quickly and efficiently lookup the most relevant images.

I would love to hear yours and others thoughts on this. I think there is a great opportunity here, but it would need a lot of input and guidance from the experts here.

Thank you,

Pedram

From: David Smiley <da...@gmail.com>
Sent: Friday, March 1, 2019 12:11 PM
To: dev@lucene.apache.org
Cc: Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>; Arun Sacheti <ar...@bing.com>; Kun Wu <Wu...@microsoft.com>; Junhua Wang <Ju...@microsoft.com>; Jason Li <ja...@microsoft.com>; René Kriegler <po...@rene-kriegler.com>
Subject: Re: Vector based store and ANN

This presentation by Rene Kriegler at Haystack 2018 was a real eye-opener to me on this subject: https://haystackconf.com/2018/relevance-scoring/<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2F2018%2Frelevance-scoring%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908753995&sdata=sD7ZF4x1iXIjJ1GDAwlc0lUWkTpkarEkd2SAXI5qev0%3D&reserved=0>. Uses random-projection forests which is a very clever technique.  (CC'ing Rene)

~ David
On Fri, Mar 1, 2019 at 1:30 PM Pedram Rezaei <pe...@microsoft.com.invalid>> wrote:
Hi there,

Thank you for the responses. Yes, we have a few scenarios in mind that can benefit from a vector-based index optimized for ANN searches:


  *   Advanced, optimized, and high precision visual search: For this to work, we would convert the images to their vector representations and then use algorithms and implementations such as SPTAG<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMicrosoft%2FSPTAG&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908763999&sdata=pOKRUksZ4sTsgtbE7eW88kiFLovTAQJRiPz%2F2LQXvCg%3D&reserved=0>, FAISS<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookresearch%2Ffaiss&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908763999&sdata=if7uUn9OysK1c%2FDh6qb7hLcWGuaDjU9W5gKF2JQzOrk%3D&reserved=0>, and HNSWLIB<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnmslib%2Fhnswlib&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908774009&sdata=%2BFHGSAWnlsfe%2BhLiimjz1T%2B3YMH90pO%2FXSi15Eszzmg%3D&reserved=0>.
  *   Advanced document retrieval: Using a numerical vector representation of a document, we could improve the search result
  *   Nearest neighbor queries: discovering the nearest neighbors to a given query could also benefit from these ANN algorithms (although doesn’t necessarily need the vector based index)

I would be grateful to hear your thoughts and whether the community is open to a conversation on this topic with my team.

Thanks,

Pedram

From: J. Delgado <jo...@gmail.com>>
Sent: Thursday, February 28, 2019 7:38 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Cc: Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>>
Subject: Re: Vector based store and ANN

Lucene’s scoring function (which I believe is okapi BM25
https://en.m.wikipedia.org/wiki/Okapi_BM25<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908774009&sdata=UsNUOOH88fog95sKTM%2FkgjYak5%2Bp%2F%2BWaMZYsMAgQ5MA%3D&reserved=0>) is a kind of nearest neighbor using the TF-IDF vector representation of documents and query. Are you interested in ANN to be applied to a different kind of vector representation, say for example Doc2Vec?

On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand <jp...@gmail.com>> wrote:
Hi Pedram,

We don't have much in this area, but I'm hearing increasing interest
so it'd be nice to get better there! The closest that we have is this
class that can search for nearest neighbors for a vector of up to 8
dimensions: https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908784014&sdata=XrrdrkhWOHp8%2FYLGowJK5%2B3km0f04Nr6BxPFxbiRQdM%3D&reserved=0>.

On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
<pe...@microsoft.com.invalid>> wrote:
>
> Hi there,
>
>
>
> Is there a way to store numerical vectors (vector based index) and perform search based on Approximate Nearest Neighbor class of algorithms in Lucene?
>
>
>
> If not, has there been any interests in the topic so far?
>
>
>
> Thanks,
>
>
>
> Pedram



--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
--
Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flinkedin.com%2Fin%2Fdavidwsmiley&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908794023&sdata=rmLY5WMZtQCZ99yumefC%2BQoglS4JeONfLShsj5qaWkU%3D&reserved=0> | Book: http://www.solrenterprisesearchserver.com<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.solrenterprisesearchserver.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908794023&sdata=DZslOJYShNLZ9GOSpstuq85F%2FwVrFtnZIVDiXe%2F%2B0fw%3D&reserved=0>

Re: Vector based store and ANN

Posted by David Smiley <da...@gmail.com>.
This presentation by Rene Kriegler at Haystack 2018 was a real eye-opener
to me on this subject: https://haystackconf.com/2018/relevance-scoring/. Uses
random-projection forests which is a very clever technique.  (CC'ing Rene)

~ David

On Fri, Mar 1, 2019 at 1:30 PM Pedram Rezaei <pe...@microsoft.com.invalid>
wrote:

> Hi there,
>
>
>
> Thank you for the responses. Yes, we have a few scenarios in mind that can
> benefit from a vector-based index optimized for ANN searches:
>
>
>
>    - Advanced, optimized, and high precision visual search: For this to
>    work, we would convert the images to their vector representations and then
>    use algorithms and implementations such as SPTAG
>    <https://github.com/Microsoft/SPTAG>, FAISS
>    <https://github.com/facebookresearch/faiss>, and HNSWLIB
>    <https://github.com/nmslib/hnswlib>.
>    - Advanced document retrieval: Using a numerical vector representation
>    of a document, we could improve the search result
>    - Nearest neighbor queries: discovering the nearest neighbors to a
>    given query could also benefit from these ANN algorithms (although doesn’t
>    necessarily need the vector based index)
>
>
>
> I would be grateful to hear your thoughts and whether the community is
> open to a conversation on this topic with my team.
>
>
>
> Thanks,
>
>
>
> Pedram
>
>
>
> *From:* J. Delgado <jo...@gmail.com>
> *Sent:* Thursday, February 28, 2019 7:38 AM
> *To:* dev@lucene.apache.org
> *Cc:* Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>
> *Subject:* Re: Vector based store and ANN
>
>
>
> Lucene’s scoring function (which I believe is okapi BM25
>
> https://en.m.wikipedia.org/wiki/Okapi_BM25
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25&data=02%7C01%7Cpedramr%40microsoft.com%7C17ae8da7b7f345efa57c08d69d92bf60%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636869650947060423&sdata=Hhj8I07%2F%2F2dSctKqpd%2FV9aEWwAI0k2dmPVwXmYe9dQw%3D&reserved=0>)
> is a kind of nearest neighbor using the TF-IDF vector representation of
> documents and query. Are you interested in ANN to be applied to a different
> kind of vector representation, say for example Doc2Vec?
>
>
>
> On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand <jp...@gmail.com> wrote:
>
> Hi Pedram,
>
> We don't have much in this area, but I'm hearing increasing interest
> so it'd be nice to get better there! The closest that we have is this
> class that can search for nearest neighbors for a vector of up to 8
> dimensions:
> https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java&data=02%7C01%7Cpedramr%40microsoft.com%7C17ae8da7b7f345efa57c08d69d92bf60%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636869650947060423&sdata=bMGC8DVC8FMsK3mfatzDF9WU5VO8FCk6G%2F1IoviPvsU%3D&reserved=0>
> .
>
> On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
> <pe...@microsoft.com.invalid> wrote:
> >
> > Hi there,
> >
> >
> >
> > Is there a way to store numerical vectors (vector based index) and
> perform search based on Approximate Nearest Neighbor class of algorithms in
> Lucene?
> >
> >
> >
> > If not, has there been any interests in the topic so far?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Pedram
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

RE: Vector based store and ANN

Posted by Pedram Rezaei <pe...@microsoft.com.INVALID>.
Hi there,

Thank you for the responses. Yes, we have a few scenarios in mind that can benefit from a vector-based index optimized for ANN searches:


  *   Advanced, optimized, and high precision visual search: For this to work, we would convert the images to their vector representations and then use algorithms and implementations such as SPTAG<https://github.com/Microsoft/SPTAG>, FAISS<https://github.com/facebookresearch/faiss>, and HNSWLIB<https://github.com/nmslib/hnswlib>.
  *   Advanced document retrieval: Using a numerical vector representation of a document, we could improve the search result
  *   Nearest neighbor queries: discovering the nearest neighbors to a given query could also benefit from these ANN algorithms (although doesn’t necessarily need the vector based index)

I would be grateful to hear your thoughts and whether the community is open to a conversation on this topic with my team.

Thanks,

Pedram

From: J. Delgado <jo...@gmail.com>
Sent: Thursday, February 28, 2019 7:38 AM
To: dev@lucene.apache.org
Cc: Radhakrishnan Srikanth (SRIKANTH) <rs...@microsoft.com>
Subject: Re: Vector based store and ANN

Lucene’s scoring function (which I believe is okapi BM25
https://en.m.wikipedia.org/wiki/Okapi_BM25<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25&data=02%7C01%7Cpedramr%40microsoft.com%7C17ae8da7b7f345efa57c08d69d92bf60%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636869650947060423&sdata=Hhj8I07%2F%2F2dSctKqpd%2FV9aEWwAI0k2dmPVwXmYe9dQw%3D&reserved=0>) is a kind of nearest neighbor using the TF-IDF vector representation of documents and query. Are you interested in ANN to be applied to a different kind of vector representation, say for example Doc2Vec?

On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand <jp...@gmail.com>> wrote:
Hi Pedram,

We don't have much in this area, but I'm hearing increasing interest
so it'd be nice to get better there! The closest that we have is this
class that can search for nearest neighbors for a vector of up to 8
dimensions: https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java&data=02%7C01%7Cpedramr%40microsoft.com%7C17ae8da7b7f345efa57c08d69d92bf60%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636869650947060423&sdata=bMGC8DVC8FMsK3mfatzDF9WU5VO8FCk6G%2F1IoviPvsU%3D&reserved=0>.

On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
<pe...@microsoft.com.invalid>> wrote:
>
> Hi there,
>
>
>
> Is there a way to store numerical vectors (vector based index) and perform search based on Approximate Nearest Neighbor class of algorithms in Lucene?
>
>
>
> If not, has there been any interests in the topic so far?
>
>
>
> Thanks,
>
>
>
> Pedram



--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>

Re: Vector based store and ANN

Posted by "J. Delgado" <jo...@gmail.com>.
Lucene’s scoring function (which I believe is okapi BM25
https://en.m.wikipedia.org/wiki/Okapi_BM25) is a kind of nearest neighbor
using the TF-IDF vector representation of documents and query. Are you
interested in ANN to be applied to a different kind of vector
representation, say for example Doc2Vec?

On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand <jp...@gmail.com> wrote:

> Hi Pedram,
>
> We don't have much in this area, but I'm hearing increasing interest
> so it'd be nice to get better there! The closest that we have is this
> class that can search for nearest neighbors for a vector of up to 8
> dimensions:
> https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java
> .
>
> On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
> <pe...@microsoft.com.invalid> wrote:
> >
> > Hi there,
> >
> >
> >
> > Is there a way to store numerical vectors (vector based index) and
> perform search based on Approximate Nearest Neighbor class of algorithms in
> Lucene?
> >
> >
> >
> > If not, has there been any interests in the topic so far?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Pedram
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Vector based store and ANN

Posted by Adrien Grand <jp...@gmail.com>.
Hi Pedram,

We don't have much in this area, but I'm hearing increasing interest
so it'd be nice to get better there! The closest that we have is this
class that can search for nearest neighbors for a vector of up to 8
dimensions: https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java.

On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
<pe...@microsoft.com.invalid> wrote:
>
> Hi there,
>
>
>
> Is there a way to store numerical vectors (vector based index) and perform search based on Approximate Nearest Neighbor class of algorithms in Lucene?
>
>
>
> If not, has there been any interests in the topic so far?
>
>
>
> Thanks,
>
>
>
> Pedram



-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org