You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Bill Dueber <bi...@dueber.com> on 2009/12/04 17:26:29 UTC

edismax using bigrams instead of phrases?

I've started trying edismax, and have noticed that my relevancy ranking is
messed up with edismax because, according to the debug output, it's using
bigrams instead of phrases and inexplicably ignoring a couple of the pf
fields. While the hit count isn't changing,  this kills my ability to boost
exact title matches (or, I would guess, exact-anything-else matches, too).

debugQuery output can be seen at:

http://paste.lisp.org/display/91582

That's the exact same query except for the defType.

Note that instead of looking in the 'pf' fields for the search string "gone
with the wind", it's looking individually for "gone with", "with the", and
"the wind".

edismax is also completely ignoring the title_a and title_ab fields, which
are defined as "exactmatcher" as follows.

<!-- Full string, stripped of \W and lowercased, for exact and left-anchored
matching -->
     <fieldType name="exactmatcher" class="solr.TextField" omitNorms="true">
       <analyzer>
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="schema.UnicodeNormalizationFilterFactory"
version="icu4j" composed="false" remove_diacritics="true"
remove_modifiers="true" fold="true"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.TrimFilterFactory"/>
         <filter class="solr.PatternReplaceFilterFactory"
              pattern="[^\p{L}\p{N}]" replacement=""  replace="all"
         />
       </analyzer>
     </fieldType>


Any help on this would be much appreciated.


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library

Re: edismax using bigrams instead of phrases?

Posted by Bill Dueber <bi...@dueber.com>.

On Mon, Dec 7, 2009 at 5:45 PM, Chris Hostetter <ho...@fucit.org>wrote:

>
> it would be a mistake to have a "pf1" field that was an alias for "pf" ...
> as it stands the "pf" parm in dismax is analogous to a "pf*" or
> "pf-Infinity"
>

Of course -- I was....well, let's just pretend I was drunk.

How about pfInf or pfAll?

>
>

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library

Re: edismax using bigrams instead of phrases?

Posted by Chris Hostetter <ho...@fucit.org>.

: I see that edismax already defines pf (bigrams) and pf3 (trigrams) -- how
: would folks think about just calling them pf / pf1 (aliases for each
: other?), pf2, and pf3? The pf would then behave exactly as it does in
: dismax.

changing edismax's current pasing logic to be applied to a "pf2" param 
and restoring the original "pf" logic certainly makes sense -- but i think 
it would be a mistake to have a "pf1" field that was an alias for "pf" ... 
as it stands the "pf" parm in dismax is analogous to a "pf*" or 
"pf-Infinity" type option requiring all of the words however many tehre 
are ... in the context of "pf2" and "pf3" a "pf1" option would imply that 
it did a phrase boosting on each individual word -- which wouldnt' be very 
useful at all (tht'as what qf is for)



-Hoss

Re: edismax using bigrams instead of phrases?

Posted by Bill Dueber <bi...@dueber.com>.

I see that edismax already defines pf (bigrams) and pf3 (trigrams) -- how
would folks think about just calling them pf / pf1 (aliases for each
other?), pf2, and pf3? The pf would then behave exactly as it does in
dismax.

And it sounds like the solution to my single-token fields is to just move
them into the query itself.

Thanks!

On Fri, Dec 4, 2009 at 11:58 AM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Fri, Dec 4, 2009 at 11:26 AM, Bill Dueber <bi...@dueber.com> wrote:
> > I've started trying edismax, and have noticed that my relevancy ranking
> is
> > messed up with edismax because, according to the debug output, it's using
> > bigrams instead of phrases and inexplicably ignoring a couple of the pf
> > fields. While the hit count isn't changing,  this kills my ability to
> boost
> > exact title matches (or, I would guess, exact-anything-else matches,
> too).
>
> It's a feature in general - the problem with putting all the terms in
> a single phrase query is that you get no boosting at all if all of the
> terms don't appear.
>
> But since it may be useful as an option, perhaps we should add the
> single-phrase option to extended dismax as well.
>
> > edismax is also completely ignoring the title_a and title_ab fields,
> which
> > are defined as "exactmatcher" as follows.
>
> I believe this is because extended dismax only adds phrases for
> boosting... hence if a field type outputs a single token, it's
> considered redundant with the main query.  This is an optimization to
> speed up queries (esp single-word queries).
> Perhaps one way to fix this would be to check if the pf is in the qf
> list before removing single term phrases?
>
> -Yonik
> http://www.lucidimagination.com
>



-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library

Re: edismax using bigrams instead of phrases?

Posted by Chris Hostetter <ho...@fucit.org>.

: > I've started trying edismax, and have noticed that my relevancy ranking is
: > messed up with edismax because, according to the debug output, it's using
: > bigrams instead of phrases and inexplicably ignoring a couple of the pf

I noticed that aw well while testing edismax on the train the other day 
(notes attached to SOLR-1553 earlier today)

: It's a feature in general - the problem with putting all the terms in
: a single phrase query is that you get no boosting at all if all of the
: terms don't appear.

But sometimes that's what you want -- pf was intended to support hte 
usecase where people remember an exact phrase from the text (ie: they 
cut/paste the title, or the first line from an abstract, etc...) and want 
that right at the top.  removing that and replacing it with a shingles 
based approach allows other docs that match lots of "bits" of the iput 
string to overshadow exact matches.


-Hoss

Re: edismax using bigrams instead of phrases?

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Fri, Dec 4, 2009 at 11:26 AM, Bill Dueber <bi...@dueber.com> wrote:
> I've started trying edismax, and have noticed that my relevancy ranking is
> messed up with edismax because, according to the debug output, it's using
> bigrams instead of phrases and inexplicably ignoring a couple of the pf
> fields. While the hit count isn't changing,  this kills my ability to boost
> exact title matches (or, I would guess, exact-anything-else matches, too).

It's a feature in general - the problem with putting all the terms in
a single phrase query is that you get no boosting at all if all of the
terms don't appear.

But since it may be useful as an option, perhaps we should add the
single-phrase option to extended dismax as well.

> edismax is also completely ignoring the title_a and title_ab fields, which
> are defined as "exactmatcher" as follows.

I believe this is because extended dismax only adds phrases for
boosting... hence if a field type outputs a single token, it's
considered redundant with the main query.  This is an optimization to
speed up queries (esp single-word queries).
Perhaps one way to fix this would be to check if the pf is in the qf
list before removing single term phrases?

-Yonik
http://www.lucidimagination.com