You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Steven Fuchs <st...@aps.org> on 2012/01/09 17:38:40 UTC
Re: issues with WordDelimiterFilter

Thanks for the reply
On Dec 30, 2011, at 6:04 PM, Chris Hostetter wrote:

> 
> : I'm having an issue with the way the WordDelimiterFilter parses compound 
> : words. My field declaration is simple, looks like this:
> : 
> :       <analyzer type="index">
> :         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> :         <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"/>
> :         <filter class="solr.LowerCaseFilterFactory"/>
> :       </analyzer>
> 
> you haven't said anything about what your query time analyzer looks like 
> -- based on your other comments, i'm going to assume it just uses 
> whitespaceTokenizer and lower case filter w/o WDF at all -- but if you 
> don't have any "query" analyzer declared that means the analyzer above is 
> used in both case, which is most likely not what you want.

yes, you are correct on my query analyzer just being whitespace and lower case. I had only omitted it for clarity.

> : So in the case where fokker-plank is the first token there should be no 
> : second token, its already been used if the first was matched. The 
> 
> that type of logic (hierarchical sequences of tokens) is just not possible 
> with lucene.

ok, so if I understand it this is an issue but can't be worked around...


> : problem manifests itself when doing phrase searches...
> : 
> : "Fokker-Plank equations" won't find the exact phrase, Fokker-Plank 
> : equations, because its sees the term planck as between Fokker-Plank and 
> : equations. Hope that makes sense! Should I submit this as a bug?
> 
> for phrase queries like this to work when using WDF, it's neccessary to 
> use some slop in your phrase query (to overcome the position gaps 
> introduced by the split out tokens) ... either that, or turn off 
> "preserveOriginal" and use a query analyzer thta also splits at query time

It seems like the preserveOriginal isn't the best option here since it introduces an extra term into its version of the text. I don't really want the query to be split as fokker-planck shouldn't find a lone planck..... I may need a seperate field that isn't WordDelimited and us an OR of the two as my result.....

> 
> : As it stands it would return a true hit (erroneously I believe) on the 
> : phrase search "fokker planck", so really all 3 tokens should be returned 
> 
> Hmmm... if you do *not* want a phrase search for "fokker planck" to match 
> documents containing "fokker-planck" then why are you using WDF at all?

I the case where quotes are used I do want an exact phrase search done. So I do want fokker, planck, fokker planck, or "fokker-planck" to match a document that contains the term fokker-planck, but not "fokker planck"


> 
> : at offset 0 and there should be no second token so phrase searches are 
> : preserved.
> 
> if all the tokens wound up in the exact same position, then a 
> phrase query for "fokker planck" would still match this document (so it 
> wouldn't solve your problem) but you would also get matches for things 
> like the phrase "planck fokker" -- which is not likelye what *anyone* 
> would expect.
> 
> 
> -Hoss


Thanks so much for your time. Very helpful!

steve