You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@solr.apache.org by Joel Bernstein <jo...@gmail.com> on 2023/05/23 15:20:33 UTC

Hybrid scoring lexical / vector

One of the things that I'm focusing on is combining the Solr similarity
score with the vector score in a consistent manner. My main concern is
dealing with the unbounded nature of the Solr similarity score and how to
balance that with a vector score.

So my first question are there any mechanisms now to scale or squash the
Solr similarity score before combining with a vector score?

Below are two ideas I have for squashing / scaling the score:

1) SquashingScoreQuery. This is a wrapper query that squashes the score of
its wrapped query using a sigmoid function.

2) Min/Max scale the main query score in the ReRanker. This simply adds a
flag to the ReRanker to min/max scale the main query scores before
combining with the ReRank query.

Do others have thoughts on this?

Re: Hybrid scoring lexical / vector

Posted by Joel Bernstein <jo...@gmail.com>.
I'll also add an implementation for RRF to the ReRanker.

https://weaviate.io/blog/hybrid-search-explained



Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, May 26, 2023 at 4:58 PM Joel Bernstein <jo...@gmail.com> wrote:

> I'm going to create a ticket for adding Min/Max scaling to the ReRanker.
> The ReRanker has access to all the topDocs so it should be pretty
> straightforward to min/max scale all the topDocs before ReRanking the topN.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Thu, May 25, 2023 at 5:18 AM Alessandro Benedetti <a....@sease.io>
> wrote:
>
>> Hi all,
>> our approach to providing hybrid search in Solr has been focused on the
>> reranking side, specifically enabling vector-based features in Learning To
>> Rank.
>> In this way, you can combine lexical features (such as the original BM25
>> score) with various vector distances (in more than one field if you like)
>> and other factors using whatever model is supported (linear, tree-based,
>> neural network)
>> To do first-stage hybrid retrieval, that should be already decently
>> available through the boolean query parser.
>>
>> We started the work with function queries (that unfortunately are
>> scattered across Lucene and Solr, and now that the projects are separate
>> again, it's a lengthy process to go with.
>> Our first step is almost ready:
>> https://github.com/apache/lucene/pull/12253
>> Any feedback is welcome!
>>
>> Then regarding the different problem of having an unbound relevance score
>> in Lucene/Solr, I agree that can (and should) be improved, I would love to
>> see it as a probabilistic score, but I imagine that making this change in
>> Lucene will cause an enormous discussion, probably ending in stand-still?
>> You have my support!
>>
>>
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>>
>> On Tue, 23 May 2023 at 19:17, Mikhail Khludnev <mk...@apache.org> wrote:
>>
>> > Hello, Joel.
>> >
>> > Here's my idea
>> > https://lists.apache.org/thread/6t45p5fk4hldrt1833kvrbobdd2pk265
>> >
>> >
>> > On Tue, May 23, 2023 at 6:20 PM Joel Bernstein <jo...@gmail.com>
>> wrote:
>> >
>> > > One of the things that I'm focusing on is combining the Solr
>> similarity
>> > > score with the vector score in a consistent manner. My main concern is
>> > > dealing with the unbounded nature of the Solr similarity score and
>> how to
>> > > balance that with a vector score.
>> > >
>> > > So my first question are there any mechanisms now to scale or squash
>> the
>> > > Solr similarity score before combining with a vector score?
>> > >
>> > > Below are two ideas I have for squashing / scaling the score:
>> > >
>> > > 1) SquashingScoreQuery. This is a wrapper query that squashes the
>> score
>> > of
>> > > its wrapped query using a sigmoid function.
>> > >
>> > > 2) Min/Max scale the main query score in the ReRanker. This simply
>> adds a
>> > > flag to the ReRanker to min/max scale the main query scores before
>> > > combining with the ReRank query.
>> > >
>> > > Do others have thoughts on this?
>> > >
>> >
>> >
>> > --
>> > Sincerely yours
>> > Mikhail Khludnev
>> > https://t.me/MUST_SEARCH
>> > A caveat: Cyrillic!
>> >
>>
>

Re: Hybrid scoring lexical / vector

Posted by Joel Bernstein <jo...@gmail.com>.
I'm going to create a ticket for adding Min/Max scaling to the ReRanker.
The ReRanker has access to all the topDocs so it should be pretty
straightforward to min/max scale all the topDocs before ReRanking the topN.


Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, May 25, 2023 at 5:18 AM Alessandro Benedetti <a....@sease.io>
wrote:

> Hi all,
> our approach to providing hybrid search in Solr has been focused on the
> reranking side, specifically enabling vector-based features in Learning To
> Rank.
> In this way, you can combine lexical features (such as the original BM25
> score) with various vector distances (in more than one field if you like)
> and other factors using whatever model is supported (linear, tree-based,
> neural network)
> To do first-stage hybrid retrieval, that should be already decently
> available through the boolean query parser.
>
> We started the work with function queries (that unfortunately are
> scattered across Lucene and Solr, and now that the projects are separate
> again, it's a lengthy process to go with.
> Our first step is almost ready:
> https://github.com/apache/lucene/pull/12253
> Any feedback is welcome!
>
> Then regarding the different problem of having an unbound relevance score
> in Lucene/Solr, I agree that can (and should) be improved, I would love to
> see it as a probabilistic score, but I imagine that making this change in
> Lucene will cause an enormous discussion, probably ending in stand-still?
> You have my support!
>
>
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Tue, 23 May 2023 at 19:17, Mikhail Khludnev <mk...@apache.org> wrote:
>
> > Hello, Joel.
> >
> > Here's my idea
> > https://lists.apache.org/thread/6t45p5fk4hldrt1833kvrbobdd2pk265
> >
> >
> > On Tue, May 23, 2023 at 6:20 PM Joel Bernstein <jo...@gmail.com>
> wrote:
> >
> > > One of the things that I'm focusing on is combining the Solr similarity
> > > score with the vector score in a consistent manner. My main concern is
> > > dealing with the unbounded nature of the Solr similarity score and how
> to
> > > balance that with a vector score.
> > >
> > > So my first question are there any mechanisms now to scale or squash
> the
> > > Solr similarity score before combining with a vector score?
> > >
> > > Below are two ideas I have for squashing / scaling the score:
> > >
> > > 1) SquashingScoreQuery. This is a wrapper query that squashes the score
> > of
> > > its wrapped query using a sigmoid function.
> > >
> > > 2) Min/Max scale the main query score in the ReRanker. This simply
> adds a
> > > flag to the ReRanker to min/max scale the main query scores before
> > > combining with the ReRank query.
> > >
> > > Do others have thoughts on this?
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t.me/MUST_SEARCH
> > A caveat: Cyrillic!
> >
>

Re: Hybrid scoring lexical / vector

Posted by Alessandro Benedetti <a....@sease.io>.
Hi all,
our approach to providing hybrid search in Solr has been focused on the
reranking side, specifically enabling vector-based features in Learning To
Rank.
In this way, you can combine lexical features (such as the original BM25
score) with various vector distances (in more than one field if you like)
and other factors using whatever model is supported (linear, tree-based,
neural network)
To do first-stage hybrid retrieval, that should be already decently
available through the boolean query parser.

We started the work with function queries (that unfortunately are
scattered across Lucene and Solr, and now that the projects are separate
again, it's a lengthy process to go with.
Our first step is almost ready:
https://github.com/apache/lucene/pull/12253
Any feedback is welcome!

Then regarding the different problem of having an unbound relevance score
in Lucene/Solr, I agree that can (and should) be improved, I would love to
see it as a probabilistic score, but I imagine that making this change in
Lucene will cause an enormous discussion, probably ending in stand-still?
You have my support!


--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Tue, 23 May 2023 at 19:17, Mikhail Khludnev <mk...@apache.org> wrote:

> Hello, Joel.
>
> Here's my idea
> https://lists.apache.org/thread/6t45p5fk4hldrt1833kvrbobdd2pk265
>
>
> On Tue, May 23, 2023 at 6:20 PM Joel Bernstein <jo...@gmail.com> wrote:
>
> > One of the things that I'm focusing on is combining the Solr similarity
> > score with the vector score in a consistent manner. My main concern is
> > dealing with the unbounded nature of the Solr similarity score and how to
> > balance that with a vector score.
> >
> > So my first question are there any mechanisms now to scale or squash the
> > Solr similarity score before combining with a vector score?
> >
> > Below are two ideas I have for squashing / scaling the score:
> >
> > 1) SquashingScoreQuery. This is a wrapper query that squashes the score
> of
> > its wrapped query using a sigmoid function.
> >
> > 2) Min/Max scale the main query score in the ReRanker. This simply adds a
> > flag to the ReRanker to min/max scale the main query scores before
> > combining with the ReRank query.
> >
> > Do others have thoughts on this?
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>

Re: Hybrid scoring lexical / vector

Posted by Mikhail Khludnev <mk...@apache.org>.
Hello, Joel.

Here's my idea
https://lists.apache.org/thread/6t45p5fk4hldrt1833kvrbobdd2pk265


On Tue, May 23, 2023 at 6:20 PM Joel Bernstein <jo...@gmail.com> wrote:

> One of the things that I'm focusing on is combining the Solr similarity
> score with the vector score in a consistent manner. My main concern is
> dealing with the unbounded nature of the Solr similarity score and how to
> balance that with a vector score.
>
> So my first question are there any mechanisms now to scale or squash the
> Solr similarity score before combining with a vector score?
>
> Below are two ideas I have for squashing / scaling the score:
>
> 1) SquashingScoreQuery. This is a wrapper query that squashes the score of
> its wrapped query using a sigmoid function.
>
> 2) Min/Max scale the main query score in the ReRanker. This simply adds a
> flag to the ReRanker to min/max scale the main query scores before
> combining with the ReRank query.
>
> Do others have thoughts on this?
>


-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!