You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Chris Book <ch...@gmail.com> on 2012/04/11 04:28:53 UTC

using solr to do a 'match'

Hello, I have a solr index running that is working very well as a search.
 But I want to add the ability (if possible) to use it to do matching.  The
problem is that by default it is only looking for all the input terms to be
present, and it doesn't give me any indication as to how many terms in the
target field were not specified by the input.

For example, if I'm trying to match to the song title "dust in the wind",
I'm correctly getting a match if the input query is "dust in wind".  But I
don't want to get a match if the input is just "dust".  Although as a
search "dust" should return this result, I'm looking for some way to filter
this out based on some indication that the input isn't close enough to the
output.  Perhaps if I could get information that that the number of input
terms is much less than the number of terms in the field.  Or something
else along those line?

I realize that this isn't the typical use case for a search, but I'm just
looking for some suggestions as to how I could improve the above example a
bit.

Thanks,
Chris

Fwd: using solr to do a 'match'

Posted by Li Li <fa...@gmail.com>.

---------- Forwarded message ----------
From: Li Li <fa...@gmail.com>
Date: Wed, Apr 11, 2012 at 4:59 PM
Subject: Re: using solr to do a 'match'
To: solr-user@lucene.apache.org


I searched my mail but nothing found.
the thread searched by key words "boolean expression" is Indexing Boolean
Expressions from joaquin.delgado
to tell which terms are matched, for BooleanScorer2, a simple method is to
modify DisjunctionSumScorer and add a BitSet to record matched scorers.
When collector collect this document, it can get the scorer and recursively
find the matched terms.
But I think maybe it's better to add a component maybe named matcher that
do the matching job, and scorer use the information from the matcher and do
ranking things.


On Wed, Apr 11, 2012 at 4:32 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Hi,
>
> This use case is similar to matching boolean expression problem. You can
> find recent thread about it. I have an idea that we can introduce
> disjunction query with dynamic mm (minShouldMatch parameter
>
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/BooleanQuery.html#setMinimumNumberShouldMatch(int)
> )
> i.e. 'match these clauses disjunctively but for every document use
> value
> from field cache of field xxxCount as a minShouldMatch parameter'. Also
> norms can be used as a source for dynamics mm values.
>
> Wdyt?
>
> On Wed, Apr 11, 2012 at 10:08 AM, Li Li <fa...@gmail.com> wrote:
>
> > it's not possible now because lucene don't support this.
> > when doing disjunction query, it only record how many terms match this
> > document.
> > I think this is a common requirement for many users.
> > I suggest lucene should divide scorer to a matcher and a scorer.
> > the matcher just return which doc is matched and why/how the doc is
> > matched.
> > especially for disjuction query, it should tell which term matches and
> > possible other
> > information such as tf/idf and the distance of terms(to support proximity
> > search).
> > That's the matcher's job. and then the scorer(a ranking algorithm) use
> > flexible algorithm
> > to score this document and the collector can collect it.
> >
> > On Wed, Apr 11, 2012 at 10:28 AM, Chris Book <ch...@gmail.com>
> wrote:
> >
> > > Hello, I have a solr index running that is working very well as a
> search.
> > >  But I want to add the ability (if possible) to use it to do matching.
> >  The
> > > problem is that by default it is only looking for all the input terms
> to
> > be
> > > present, and it doesn't give me any indication as to how many terms in
> > the
> > > target field were not specified by the input.
> > >
> > > For example, if I'm trying to match to the song title "dust in the
> wind",
> > > I'm correctly getting a match if the input query is "dust in wind".
>  But
> > I
> > > don't want to get a match if the input is just "dust".  Although as a
> > > search "dust" should return this result, I'm looking for some way to
> > filter
> > > this out based on some indication that the input isn't close enough to
> > the
> > > output.  Perhaps if I could get information that that the number of
> input
> > > terms is much less than the number of terms in the field.  Or something
> > > else along those line?
> > >
> > > I realize that this isn't the typical use case for a search, but I'm
> just
> > > looking for some suggestions as to how I could improve the above
> example
> > a
> > > bit.
> > >
> > > Thanks,
> > > Chris
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> gedel@yandex.ru
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: using solr to do a 'match'

Posted by Li Li <fa...@gmail.com>.

I searched my mail but nothing found.
the thread searched by key words "boolean expression" is Indexing Boolean
Expressions from joaquin.delgado
to tell which terms are matched, for BooleanScorer2, a simple method is to
modify DisjunctionSumScorer and add a BitSet to record matched scorers.
When collector collect this document, it can get the scorer and recursively
find the matched terms.
But I think maybe it's better to add a component maybe named matcher that
do the matching job, and scorer use the information from the matcher and do
ranking things.

On Wed, Apr 11, 2012 at 4:32 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Hi,
>
> This use case is similar to matching boolean expression problem. You can
> find recent thread about it. I have an idea that we can introduce
> disjunction query with dynamic mm (minShouldMatch parameter
>
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/BooleanQuery.html#setMinimumNumberShouldMatch(int)
> )
> i.e. 'match these clauses disjunctively but for every document use
> value
> from field cache of field xxxCount as a minShouldMatch parameter'. Also
> norms can be used as a source for dynamics mm values.
>
> Wdyt?
>
> On Wed, Apr 11, 2012 at 10:08 AM, Li Li <fa...@gmail.com> wrote:
>
> > it's not possible now because lucene don't support this.
> > when doing disjunction query, it only record how many terms match this
> > document.
> > I think this is a common requirement for many users.
> > I suggest lucene should divide scorer to a matcher and a scorer.
> > the matcher just return which doc is matched and why/how the doc is
> > matched.
> > especially for disjuction query, it should tell which term matches and
> > possible other
> > information such as tf/idf and the distance of terms(to support proximity
> > search).
> > That's the matcher's job. and then the scorer(a ranking algorithm) use
> > flexible algorithm
> > to score this document and the collector can collect it.
> >
> > On Wed, Apr 11, 2012 at 10:28 AM, Chris Book <ch...@gmail.com>
> wrote:
> >
> > > Hello, I have a solr index running that is working very well as a
> search.
> > >  But I want to add the ability (if possible) to use it to do matching.
> >  The
> > > problem is that by default it is only looking for all the input terms
> to
> > be
> > > present, and it doesn't give me any indication as to how many terms in
> > the
> > > target field were not specified by the input.
> > >
> > > For example, if I'm trying to match to the song title "dust in the
> wind",
> > > I'm correctly getting a match if the input query is "dust in wind".
>  But
> > I
> > > don't want to get a match if the input is just "dust".  Although as a
> > > search "dust" should return this result, I'm looking for some way to
> > filter
> > > this out based on some indication that the input isn't close enough to
> > the
> > > output.  Perhaps if I could get information that that the number of
> input
> > > terms is much less than the number of terms in the field.  Or something
> > > else along those line?
> > >
> > > I realize that this isn't the typical use case for a search, but I'm
> just
> > > looking for some suggestions as to how I could improve the above
> example
> > a
> > > bit.
> > >
> > > Thanks,
> > > Chris
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> gedel@yandex.ru
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: using solr to do a 'match'

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Hi,

This use case is similar to matching boolean expression problem. You can
find recent thread about it. I have an idea that we can introduce
disjunction query with dynamic mm (minShouldMatch parameter
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/BooleanQuery.html#setMinimumNumberShouldMatch(int))
i.e. 'match these clauses disjunctively but for every document use
value
from field cache of field xxxCount as a minShouldMatch parameter'. Also
norms can be used as a source for dynamics mm values.

Wdyt?

On Wed, Apr 11, 2012 at 10:08 AM, Li Li <fa...@gmail.com> wrote:

> it's not possible now because lucene don't support this.
> when doing disjunction query, it only record how many terms match this
> document.
> I think this is a common requirement for many users.
> I suggest lucene should divide scorer to a matcher and a scorer.
> the matcher just return which doc is matched and why/how the doc is
> matched.
> especially for disjuction query, it should tell which term matches and
> possible other
> information such as tf/idf and the distance of terms(to support proximity
> search).
> That's the matcher's job. and then the scorer(a ranking algorithm) use
> flexible algorithm
> to score this document and the collector can collect it.
>
> On Wed, Apr 11, 2012 at 10:28 AM, Chris Book <ch...@gmail.com> wrote:
>
> > Hello, I have a solr index running that is working very well as a search.
> >  But I want to add the ability (if possible) to use it to do matching.
>  The
> > problem is that by default it is only looking for all the input terms to
> be
> > present, and it doesn't give me any indication as to how many terms in
> the
> > target field were not specified by the input.
> >
> > For example, if I'm trying to match to the song title "dust in the wind",
> > I'm correctly getting a match if the input query is "dust in wind".  But
> I
> > don't want to get a match if the input is just "dust".  Although as a
> > search "dust" should return this result, I'm looking for some way to
> filter
> > this out based on some indication that the input isn't close enough to
> the
> > output.  Perhaps if I could get information that that the number of input
> > terms is much less than the number of terms in the field.  Or something
> > else along those line?
> >
> > I realize that this isn't the typical use case for a search, but I'm just
> > looking for some suggestions as to how I could improve the above example
> a
> > bit.
> >
> > Thanks,
> > Chris
> >
>



-- 
Sincerely yours
Mikhail Khludnev
gedel@yandex.ru

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: using solr to do a 'match'

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Hi,

This use case is similar to matching boolean expression problem. You can
find recent thread about it. I have an idea that we can introduce
disjunction query with dynamic mm (minShouldMatch parameter
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/BooleanQuery.html#setMinimumNumberShouldMatch(int))
i.e. 'match these clauses disjunctively but for every document use
value
from field cache of field xxxCount as a minShouldMatch parameter'. Also
norms can be used as a source for dynamics mm values.

Wdyt?

On Wed, Apr 11, 2012 at 10:08 AM, Li Li <fa...@gmail.com> wrote:

> it's not possible now because lucene don't support this.
> when doing disjunction query, it only record how many terms match this
> document.
> I think this is a common requirement for many users.
> I suggest lucene should divide scorer to a matcher and a scorer.
> the matcher just return which doc is matched and why/how the doc is
> matched.
> especially for disjuction query, it should tell which term matches and
> possible other
> information such as tf/idf and the distance of terms(to support proximity
> search).
> That's the matcher's job. and then the scorer(a ranking algorithm) use
> flexible algorithm
> to score this document and the collector can collect it.
>
> On Wed, Apr 11, 2012 at 10:28 AM, Chris Book <ch...@gmail.com> wrote:
>
> > Hello, I have a solr index running that is working very well as a search.
> >  But I want to add the ability (if possible) to use it to do matching.
>  The
> > problem is that by default it is only looking for all the input terms to
> be
> > present, and it doesn't give me any indication as to how many terms in
> the
> > target field were not specified by the input.
> >
> > For example, if I'm trying to match to the song title "dust in the wind",
> > I'm correctly getting a match if the input query is "dust in wind".  But
> I
> > don't want to get a match if the input is just "dust".  Although as a
> > search "dust" should return this result, I'm looking for some way to
> filter
> > this out based on some indication that the input isn't close enough to
> the
> > output.  Perhaps if I could get information that that the number of input
> > terms is much less than the number of terms in the field.  Or something
> > else along those line?
> >
> > I realize that this isn't the typical use case for a search, but I'm just
> > looking for some suggestions as to how I could improve the above example
> a
> > bit.
> >
> > Thanks,
> > Chris
> >
>



-- 
Sincerely yours
Mikhail Khludnev
gedel@yandex.ru

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: using solr to do a 'match'

Posted by Li Li <fa...@gmail.com>.

it's not possible now because lucene don't support this.
when doing disjunction query, it only record how many terms match this
document.
I think this is a common requirement for many users.
I suggest lucene should divide scorer to a matcher and a scorer.
the matcher just return which doc is matched and why/how the doc is matched.
especially for disjuction query, it should tell which term matches and
possible other
information such as tf/idf and the distance of terms(to support proximity
search).
That's the matcher's job. and then the scorer(a ranking algorithm) use
flexible algorithm
to score this document and the collector can collect it.

On Wed, Apr 11, 2012 at 10:28 AM, Chris Book <ch...@gmail.com> wrote:

> Hello, I have a solr index running that is working very well as a search.
>  But I want to add the ability (if possible) to use it to do matching.  The
> problem is that by default it is only looking for all the input terms to be
> present, and it doesn't give me any indication as to how many terms in the
> target field were not specified by the input.
>
> For example, if I'm trying to match to the song title "dust in the wind",
> I'm correctly getting a match if the input query is "dust in wind".  But I
> don't want to get a match if the input is just "dust".  Although as a
> search "dust" should return this result, I'm looking for some way to filter
> this out based on some indication that the input isn't close enough to the
> output.  Perhaps if I could get information that that the number of input
> terms is much less than the number of terms in the field.  Or something
> else along those line?
>
> I realize that this isn't the typical use case for a search, but I'm just
> looking for some suggestions as to how I could improve the above example a
> bit.
>
> Thanks,
> Chris
>

Re: using solr to do a 'match'

Posted by Li Li <fa...@gmail.com>.

it's not possible now because lucene don't support this.
when doing disjunction query, it only record how many terms match this
document.
I think this is a common requirement for many users.
I suggest lucene should divide scorer to a matcher and a scorer.
the matcher just return which doc is matched and why/how the doc is matched.
especially for disjuction query, it should tell which term matches and
possible other
information such as tf/idf and the distance of terms(to support proximity
search).
That's the matcher's job. and then the scorer(a ranking algorithm) use
flexible algorithm
to score this document and the collector can collect it.

On Wed, Apr 11, 2012 at 10:28 AM, Chris Book <ch...@gmail.com> wrote:

> Hello, I have a solr index running that is working very well as a search.
>  But I want to add the ability (if possible) to use it to do matching.  The
> problem is that by default it is only looking for all the input terms to be
> present, and it doesn't give me any indication as to how many terms in the
> target field were not specified by the input.
>
> For example, if I'm trying to match to the song title "dust in the wind",
> I'm correctly getting a match if the input query is "dust in wind".  But I
> don't want to get a match if the input is just "dust".  Although as a
> search "dust" should return this result, I'm looking for some way to filter
> this out based on some indication that the input isn't close enough to the
> output.  Perhaps if I could get information that that the number of input
> terms is much less than the number of terms in the field.  Or something
> else along those line?
>
> I realize that this isn't the typical use case for a search, but I'm just
> looking for some suggestions as to how I could improve the above example a
> bit.
>
> Thanks,
> Chris
>

Re: using solr to do a 'match'

Posted by jmlucjav <jm...@gmail.com>.

I have done that by getting X top hits, finding the best match among them
(combination of Levenshtein distance, contains...tweaked the code till
testing showed good results), and then deciding if the candidate was a match
or not, again based in custom code plus a user defined leniency value

xab

--
View this message in context: http://lucene.472066.n3.nabble.com/using-solr-to-do-a-match-tp3901436p3901884.html
Sent from the Solr - User mailing list archive at Nabble.com.