You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Demian Katz <de...@villanova.edu> on 2011/10/25 19:13:18 UTC

DisMax and WordDelimiterFilterFactory

I've seen a couple of threads related to this subject (for example, http://www.mail-archive.com/solr-user@lucene.apache.org/msg33400.html), but I haven't found an answer that addresses the aspect of the problem that concerns me...

I have a field type set up like this:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

The important feature here is the use of WordDelimiterFilterFactory, which allows a search for "WiFi" to match an indexed term of "wi fi" (for example).

The problem, of course, is that if a user accidentally introduces a case change in their query, the query analyzer chain breaks it into multiple words and no hits are found...  so a search for "exaMple" will look for "exa mple" and fail.

I've found two solutions that resolve this problem in the admin panel field analysis tool:


1.)    Turn on catenateWords and catenateNumbers in the query analyzer - this reassembles the user's broken word and allows a match.

2.)    Turn on preserveOriginal in the query analyzer - this passes through the user's original query, which then gets cleaned up bythe ICUFoldingFilterFactory and allows a match.

The problem is that in my real-world application, which uses DisMax, neither of these solutions work.  It appears that even though (if I understand correctly) the WordDelimiterFilterFactory is returning ALTERNATIVE tokens, the DisMax handler is combining them a way that requires all of them to match in an inappropriate way...  for example, here's partial debugQuery output for the "exaMple" search using Dismax and solution #2 above:

    "parsedquery":"+DisjunctionMaxQuery((genre:\"(exampl exa) mple\"^300.0 | title_new:\"(exampl exa) mple\"^100.0 | topic:\"(exampl exa) mple\"^500.0 | series:\"(exampl exa) mple\"^50.0 | title_full_unstemmed:\"(example exa) mple\"^600.0 | geographic:\"(exampl exa) mple\"^300.0 | contents:\"(exampl exa) mple\"^10.0 | fulltext_unstemmed:\"(example exa) mple\"^10.0 | allfields_unstemmed:\"(example exa) mple\"^10.0 | title_alt:\"(exampl exa) mple\"^200.0 | series2:\"(exampl exa) mple\"^30.0 | title_short:\"(exampl exa) mple\"^750.0 | author:\"(example exa) mple\"^300.0 | title:\"(exampl exa) mple\"^500.0 | topic_unstemmed:\"(example exa) mple\"^550.0 | allfields:\"(exampl exa) mple\" | author_fuller:\"(example exa) mple\"^150.0 | title_full:\"(exampl exa) mple\"^400.0 | fulltext:\"(exampl exa) mple\")) ()",

Obviously, that is not what I want - ideally it would be something like 'exampl OR "ex ample"'.

I also read about the autoGeneratePhraseQueries setting, but that seems to take things way too far in the opposite direction - if I set that to false, then I get matches for any individual token; i.e. example OR ex OR ample - not good at all!

I have a sinking suspicion that there is not an easy solution to my problem, but this seems to be a fairly basic need; splitOnCaseChange is a useful feature to have, but it's more valuable if it serves as an ALTERNATIVE search rather than a necessary query munge.  Any thoughts?

thanks,
Demian

RE: DisMax and WordDelimiterFilterFactory (limitations of MultiPhraseQuery)

Posted by Demian Katz <de...@villanova.edu>.
If we change the query chain to not split on case change, then we lose half the benefit of that feature -- if a user types "WiFi" and the source record contains "wi fi," we fail to get a hit.  As you say, that may be worth considering if it comes down to picking the lesser evil, but I still think there should be a complete solution to my problem -- I'm not trying to compensate for every fat-fingered user behavior... just one specific one!

Ultimately, I think my problem relates to this note from the documentation about using phrases in the SynonymFilterFactory:

"Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect. This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term. For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabiscuit" occuring in a document."

So I suppose I'm just running up against a fundamental limitation of Solr...  but this seems like a fundamental limitation that might be worth overcoming -- I'm sure my use case is not the only one where this could matter.  Has anyone given this any thought?

- Demian

> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Thursday, October 27, 2011 8:21 AM
> To: solr-user@lucene.apache.org
> Subject: Re: DisMax and WordDelimiterFilterFactory
> 
> What happens if you change your WDDF definition in the query part of
> your analysis
> chain to NOT split on case change? Then your index should contain the
> right
> fragments (and combined words) and your queries would match.
> 
> I admit I haven't thought this through entirely, but this would work
> for your example I
> think. Unfortunately I suspect it would break other cases.... I
> suspect you're in a
> "lesser of two evils" situation.
> 
> But I can't imagine a 100% solution here. You're effectively asking to
> compensate for
> any fat-fingered thing a user does. Impossible I think...
> 
> Best
> Erick
> 
> On Tue, Oct 25, 2011 at 1:13 PM, Demian Katz
> <de...@villanova.edu> wrote:
> > I've seen a couple of threads related to this subject (for example,
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg33400.html),
> but I haven't found an answer that addresses the aspect of the problem
> that concerns me...
> >
> > I have a field type set up like this:
> >
> >    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.ICUTokenizerFactory"/>
> >        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
> >        <filter class="solr.ICUFoldingFilterFactory"/>
> >        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"/>
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.ICUTokenizerFactory"/>
> >        <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
> >        <filter class="solr.ICUFoldingFilterFactory"/>
> >        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"/>
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >      </analyzer>
> >    </fieldType>
> >
> > The important feature here is the use of WordDelimiterFilterFactory,
> which allows a search for "WiFi" to match an indexed term of "wi fi"
> (for example).
> >
> > The problem, of course, is that if a user accidentally introduces a
> case change in their query, the query analyzer chain breaks it into
> multiple words and no hits are found...  so a search for "exaMple" will
> look for "exa mple" and fail.
> >
> > I've found two solutions that resolve this problem in the admin panel
> field analysis tool:
> >
> >
> > 1.)    Turn on catenateWords and catenateNumbers in the query
> analyzer - this reassembles the user's broken word and allows a match.
> >
> > 2.)    Turn on preserveOriginal in the query analyzer - this passes
> through the user's original query, which then gets cleaned up bythe
> ICUFoldingFilterFactory and allows a match.
> >
> > The problem is that in my real-world application, which uses DisMax,
> neither of these solutions work.  It appears that even though (if I
> understand correctly) the WordDelimiterFilterFactory is returning
> ALTERNATIVE tokens, the DisMax handler is combining them a way that
> requires all of them to match in an inappropriate way...  for example,
> here's partial debugQuery output for the "exaMple" search using Dismax
> and solution #2 above:
> >
> >    "parsedquery":"+DisjunctionMaxQuery((genre:\"(exampl exa)
> mple\"^300.0 | title_new:\"(exampl exa) mple\"^100.0 | topic:\"(exampl
> exa) mple\"^500.0 | series:\"(exampl exa) mple\"^50.0 |
> title_full_unstemmed:\"(example exa) mple\"^600.0 |
> geographic:\"(exampl exa) mple\"^300.0 | contents:\"(exampl exa)
> mple\"^10.0 | fulltext_unstemmed:\"(example exa) mple\"^10.0 |
> allfields_unstemmed:\"(example exa) mple\"^10.0 | title_alt:\"(exampl
> exa) mple\"^200.0 | series2:\"(exampl exa) mple\"^30.0 |
> title_short:\"(exampl exa) mple\"^750.0 | author:\"(example exa)
> mple\"^300.0 | title:\"(exampl exa) mple\"^500.0 |
> topic_unstemmed:\"(example exa) mple\"^550.0 | allfields:\"(exampl exa)
> mple\" | author_fuller:\"(example exa) mple\"^150.0 |
> title_full:\"(exampl exa) mple\"^400.0 | fulltext:\"(exampl exa)
> mple\")) ()",
> >
> > Obviously, that is not what I want - ideally it would be something
> like 'exampl OR "ex ample"'.
> >
> > I also read about the autoGeneratePhraseQueries setting, but that
> seems to take things way too far in the opposite direction - if I set
> that to false, then I get matches for any individual token; i.e.
> example OR ex OR ample - not good at all!
> >
> > I have a sinking suspicion that there is not an easy solution to my
> problem, but this seems to be a fairly basic need; splitOnCaseChange is
> a useful feature to have, but it's more valuable if it serves as an
> ALTERNATIVE search rather than a necessary query munge.  Any thoughts?
> >
> > thanks,
> > Demian
> >

Re: DisMax and WordDelimiterFilterFactory

Posted by Erick Erickson <er...@gmail.com>.
What happens if you change your WDDF definition in the query part of
your analysis
chain to NOT split on case change? Then your index should contain the right
fragments (and combined words) and your queries would match.

I admit I haven't thought this through entirely, but this would work
for your example I
think. Unfortunately I suspect it would break other cases.... I
suspect you're in a
"lesser of two evils" situation.

But I can't imagine a 100% solution here. You're effectively asking to
compensate for
any fat-fingered thing a user does. Impossible I think...

Best
Erick

On Tue, Oct 25, 2011 at 1:13 PM, Demian Katz <de...@villanova.edu> wrote:
> I've seen a couple of threads related to this subject (for example, http://www.mail-archive.com/solr-user@lucene.apache.org/msg33400.html), but I haven't found an answer that addresses the aspect of the problem that concerns me...
>
> I have a field type set up like this:
>
>    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.ICUTokenizerFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
>        <filter class="solr.ICUFoldingFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.ICUTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
>        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
>        <filter class="solr.ICUFoldingFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> The important feature here is the use of WordDelimiterFilterFactory, which allows a search for "WiFi" to match an indexed term of "wi fi" (for example).
>
> The problem, of course, is that if a user accidentally introduces a case change in their query, the query analyzer chain breaks it into multiple words and no hits are found...  so a search for "exaMple" will look for "exa mple" and fail.
>
> I've found two solutions that resolve this problem in the admin panel field analysis tool:
>
>
> 1.)    Turn on catenateWords and catenateNumbers in the query analyzer - this reassembles the user's broken word and allows a match.
>
> 2.)    Turn on preserveOriginal in the query analyzer - this passes through the user's original query, which then gets cleaned up bythe ICUFoldingFilterFactory and allows a match.
>
> The problem is that in my real-world application, which uses DisMax, neither of these solutions work.  It appears that even though (if I understand correctly) the WordDelimiterFilterFactory is returning ALTERNATIVE tokens, the DisMax handler is combining them a way that requires all of them to match in an inappropriate way...  for example, here's partial debugQuery output for the "exaMple" search using Dismax and solution #2 above:
>
>    "parsedquery":"+DisjunctionMaxQuery((genre:\"(exampl exa) mple\"^300.0 | title_new:\"(exampl exa) mple\"^100.0 | topic:\"(exampl exa) mple\"^500.0 | series:\"(exampl exa) mple\"^50.0 | title_full_unstemmed:\"(example exa) mple\"^600.0 | geographic:\"(exampl exa) mple\"^300.0 | contents:\"(exampl exa) mple\"^10.0 | fulltext_unstemmed:\"(example exa) mple\"^10.0 | allfields_unstemmed:\"(example exa) mple\"^10.0 | title_alt:\"(exampl exa) mple\"^200.0 | series2:\"(exampl exa) mple\"^30.0 | title_short:\"(exampl exa) mple\"^750.0 | author:\"(example exa) mple\"^300.0 | title:\"(exampl exa) mple\"^500.0 | topic_unstemmed:\"(example exa) mple\"^550.0 | allfields:\"(exampl exa) mple\" | author_fuller:\"(example exa) mple\"^150.0 | title_full:\"(exampl exa) mple\"^400.0 | fulltext:\"(exampl exa) mple\")) ()",
>
> Obviously, that is not what I want - ideally it would be something like 'exampl OR "ex ample"'.
>
> I also read about the autoGeneratePhraseQueries setting, but that seems to take things way too far in the opposite direction - if I set that to false, then I get matches for any individual token; i.e. example OR ex OR ample - not good at all!
>
> I have a sinking suspicion that there is not an easy solution to my problem, but this seems to be a fairly basic need; splitOnCaseChange is a useful feature to have, but it's more valuable if it serves as an ALTERNATIVE search rather than a necessary query munge.  Any thoughts?
>
> thanks,
> Demian
>