You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Steven Fuchs <st...@aps.org> on 2012/01/09 17:38:40 UTC
Re: issues with WordDelimiterFilter
Thanks for the reply
On Dec 30, 2011, at 6:04 PM, Chris Hostetter wrote:
>
> : I'm having an issue with the way the WordDelimiterFilter parses compound
> : words. My field declaration is simple, looks like this:
> :
> : <analyzer type="index">
> : <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> : <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"/>
> : <filter class="solr.LowerCaseFilterFactory"/>
> : </analyzer>
>
> you haven't said anything about what your query time analyzer looks like
> -- based on your other comments, i'm going to assume it just uses
> whitespaceTokenizer and lower case filter w/o WDF at all -- but if you
> don't have any "query" analyzer declared that means the analyzer above is
> used in both case, which is most likely not what you want.
yes, you are correct on my query analyzer just being whitespace and lower case. I had only omitted it for clarity.
> : So in the case where fokker-plank is the first token there should be no
> : second token, its already been used if the first was matched. The
>
> that type of logic (hierarchical sequences of tokens) is just not possible
> with lucene.
ok, so if I understand it this is an issue but can't be worked around...
> : problem manifests itself when doing phrase searches...
> :
> : "Fokker-Plank equations" won't find the exact phrase, Fokker-Plank
> : equations, because its sees the term planck as between Fokker-Plank and
> : equations. Hope that makes sense! Should I submit this as a bug?
>
> for phrase queries like this to work when using WDF, it's neccessary to
> use some slop in your phrase query (to overcome the position gaps
> introduced by the split out tokens) ... either that, or turn off
> "preserveOriginal" and use a query analyzer thta also splits at query time
It seems like the preserveOriginal isn't the best option here since it introduces an extra term into its version of the text. I don't really want the query to be split as fokker-planck shouldn't find a lone planck..... I may need a seperate field that isn't WordDelimited and us an OR of the two as my result.....
>
> : As it stands it would return a true hit (erroneously I believe) on the
> : phrase search "fokker planck", so really all 3 tokens should be returned
>
> Hmmm... if you do *not* want a phrase search for "fokker planck" to match
> documents containing "fokker-planck" then why are you using WDF at all?
I the case where quotes are used I do want an exact phrase search done. So I do want fokker, planck, fokker planck, or "fokker-planck" to match a document that contains the term fokker-planck, but not "fokker planck"
>
> : at offset 0 and there should be no second token so phrase searches are
> : preserved.
>
> if all the tokens wound up in the exact same position, then a
> phrase query for "fokker planck" would still match this document (so it
> wouldn't solve your problem) but you would also get matches for things
> like the phrase "planck fokker" -- which is not likelye what *anyone*
> would expect.
>
>
> -Hoss
Thanks so much for your time. Very helpful!
steve