You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Philipp Nanz <ph...@gmail.com> on 2007/03/06 01:08:05 UTC

alternative scoring algorithm for PhraseQuery

Hello folks,

Maybe one of you can help me with this (sorry, long read).

I have implemented a FuzzyPhraseQuery that works similar to Lucene's
native PhraseQuery.
I.e. it can retrieve phrases for a query, with respect to insertions
and term order.
But in addition it can also find matches with terms missing (deletions).

Scoring is implemented as described here:
http://www.gossamer-threads.com/lists/lucene/java-user/33558#33558

So the scorer uses the total error rather than the maximum error for
insertions and out-of-order. That part works all fine (eventhough the
total errors I'm observing quickly lead to very low frequencies
returned by sloppyFreq() )

Now my problem is with scoring the deletion cases.

My initial idea was to penalize a missing term position with its maximum error.

Consider this:

Query:  a b c d

Document A: b c d

Term a is missing, score it as if it was at the worst position possible

result:       b c d a
pos. diffs: -1 -1 -1 +3

It can be observed that the max error for the nth missing term is 2n - 2
If you have a query given with 100 terms and say 10 of them are not
found, I would have a penalty of 190 + 192 + 194 etc.

for extreme cases, this is rather simple to calculate. in the middle
of a phrase, things get tricky though. Also the penalty becomes higher
as the number of terms increases.

So I think this is no viable solution for my problem.

Does anyone know a better solution for scoring deletion cases?

Thanks for your input,
Philipp

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: alternative scoring algorithm for PhraseQuery

Posted by Paul Elschot <pa...@xs4all.nl>.

Philipp,

First off: I have no solutions, just some existing things that might
be useful.

On Tuesday 06 March 2007 01:08, Philipp Nanz wrote:
> Hello folks,
> 
...
> 
> Now my problem is with scoring the deletion cases.
> 
> My initial idea was to penalize a missing term position with its maximum 
error.
> 
...
> 
> Does anyone know a better solution for scoring deletion cases?

The closest existing thing is coord() factor from Similarity.
But using that will only allow you to delay the implementation.

You can also use the sloppyFreq() from Similarity, but then
you still need a way to determine the sloppiness for any possible
term order. Span queries to that by just taking the distance
between the first and last matched "term".

Would your implementation also generalize to a SpanQuery?

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: alternative scoring algorithm for PhraseQuery

Posted by Chris Hostetter <ho...@fucit.org>.

: Query: a b c d
: Doc A: a b c d => sloppyFreq(0) * coord(4, 4) = 1
: Doc B: a b c => sloppyFreq(0) * coord(3, 4) = 0,75
:
: Doc would score higher. I guess that might be a valid solution.
: There is a drawback though, i.e. sloppyFreq(1) * coord(4, 4) = 0,5
: So a perfect match with one insertion would score less than a 3 of 4
: match with no slop.

but now you've put the control in the hands of the client: they can choose
a Similarity based on what is more important too them: if matching more
clauses is important, they can have a strict coord function, if matching
with less slop is more important they can have a strict sloppyFreq method.

: don't know the inner workings of SpanQueries, but what you describe
: sounds alot like what the PhraseQuery does as well (i.e. calculate max
: distance between last and first term, and use that with sloppyFreq()).

correct, the big advantage of Span queries is that while a SpanNearQuery
is roughly equivilent to a PhraseQuery, a PhraseQuery can only contain
Terms, whilea SPanNearQuery can contain other spans ... so a spannear
query for: "a b c d" can function even if "a" is a complicated sub query
(like "x OR y OR (p near q but not with z between them)")



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: alternative scoring algorithm for PhraseQuery

Posted by Ken Krugler <kk...@transpac.com>.

Hi Philipp,

At 10:49 pm +0100 3/7/07, Paul Elschot wrote:
>On Wednesday 07 March 2007 18:12, Philipp Nanz wrote:
>>  Thanks for your answers. Your input is really appreciated :-)
>>
>>  @Paul Elschot:
>>  Thanks for the hint. I guess I could use coord() to penalize missing
>>  terms like this:
>>
>>  Query: a b c d
>>  Doc A: a b c d => sloppyFreq(0) * coord(4, 4) = 1
>>  Doc B: a b c => sloppyFreq(0) * coord(3, 4) = 0,75
>>
>>  Doc would score higher. I guess that might be a valid solution.
>>
>>  There is a drawback though, i.e. sloppyFreq(1) * coord(4, 4) = 0,5
>>
>>  So a perfect match with one insertion would score less than a 3 of 4
>>  match with no slop.
>
>Your examples are based on DefaultSimilarity.
>With a  Similarity in your Scorer you can leave the tradeoff between these
>factors to the user of your query by letting them provide the Similarity
>at query time.

[snip]

I'm curious if Paul's input here helped you finish your 
FuzzyPhraseQuery (or FuzzySpanQuery) addition to Lucene.

Thanks,

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: alternative scoring algorithm for PhraseQuery

Posted by Paul Elschot <pa...@xs4all.nl>.

On Wednesday 07 March 2007 18:12, Philipp Nanz wrote:
> Thanks for your answers. Your input is really appreciated :-)
> 
> @Paul Elschot:
> Thanks for the hint. I guess I could use coord() to penalize missing
> terms like this:
> 
> Query: a b c d
> Doc A: a b c d => sloppyFreq(0) * coord(4, 4) = 1
> Doc B: a b c => sloppyFreq(0) * coord(3, 4) = 0,75
> 
> Doc would score higher. I guess that might be a valid solution.
> 
> There is a drawback though, i.e. sloppyFreq(1) * coord(4, 4) = 0,5
> 
> So a perfect match with one insertion would score less than a 3 of 4
> match with no slop.

Your examples are based on DefaultSimilarity. 
With a  Similarity in your Scorer you can leave the tradeoff between these
factors to the user of your query by letting them provide the Similarity
at query time.

> 
> As for spanqueries:
> My implementation is based of the default PhraseQuery with slop > 0. I
> don't know the inner workings of SpanQueries, but what you describe
> sounds alot like what the PhraseQuery does as well (i.e. calculate max
> distance between last and first term, and use that with sloppyFreq()).
> 
> I chose PhraseQuery as base of my work, because I felt that it would
> offer better performance than firing off a plethora of spanqueries to
> express the same query.
> 
> Long story short: My problem would generalize to spanqueries if
> spanqueries would face the problem of deleted terms. But I guess they
> don't?!

Correct, they don't.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: alternative scoring algorithm for PhraseQuery

Posted by Philipp Nanz <ph...@gmail.com>.

Thanks for your answers. Your input is really appreciated :-)

@Paul Elschot:
Thanks for the hint. I guess I could use coord() to penalize missing
terms like this:

Query: a b c d
Doc A: a b c d => sloppyFreq(0) * coord(4, 4) = 1
Doc B: a b c => sloppyFreq(0) * coord(3, 4) = 0,75

Doc would score higher. I guess that might be a valid solution.

There is a drawback though, i.e. sloppyFreq(1) * coord(4, 4) = 0,5

So a perfect match with one insertion would score less than a 3 of 4
match with no slop.

As for spanqueries:
My implementation is based of the default PhraseQuery with slop > 0. I
don't know the inner workings of SpanQueries, but what you describe
sounds alot like what the PhraseQuery does as well (i.e. calculate max
distance between last and first term, and use that with sloppyFreq()).

I chose PhraseQuery as base of my work, because I felt that it would
offer better performance than firing off a plethora of spanqueries to
express the same query.

Long story short: My problem would generalize to spanqueries if
spanqueries would face the problem of deleted terms. But I guess they
don't?!

@Chris Hostetter: You are absolutely right. But it shows off into
which direction it could go to. Perhaps I could add +1 (or some other
amount) as additional penalty to the maximum error for missing terms
to distinguish between these cases further.

But still this could lead to a case where

Doc A: a b c x1 x2 [more x...] xn d
will be scored lower than
Doc B: a b c
(because the distance of A can exceed the penalty for the missing term
- its only a matter of choosing the right n)
which is questionable as well.

2007/3/6, Chris Hostetter <ho...@fucit.org>:
> : My initial idea was to penalize a missing term position with its maximum error.
> :
> : Consider this:
> : Query:  a b c d
> : Document A: b c d
> :
> : Term a is missing, score it as if it was at the worst position possible
> :
> : result:       b c d a
> : pos. diffs: -1 -1 -1 +3
>
> side comment: this doesn't sound very useful, a document containing "b c
> d" matches equally to a doc containing "b c d a" ? ... shouldn't a doc
> containing "b c d a" be considered a much better match since it at least
> contains all of the terms close together?
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: alternative scoring algorithm for PhraseQuery

Posted by Chris Hostetter <ho...@fucit.org>.

: My initial idea was to penalize a missing term position with its maximum error.
:
: Consider this:
: Query:  a b c d
: Document A: b c d
:
: Term a is missing, score it as if it was at the worst position possible
:
: result:       b c d a
: pos. diffs: -1 -1 -1 +3

side comment: this doesn't sound very useful, a document containing "b c
d" matches equally to a doc containing "b c d a" ? ... shouldn't a doc
containing "b c d a" be considered a much better match since it at least
contains all of the terms close together?



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org