You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2014/03/16 06:00:44 UTC

[jira] [Updated] (SOLR-1799) enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter

     [ https://issues.apache.org/jira/browse/SOLR-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Smiley updated SOLR-1799:
-------------------------------

    Fix Version/s:     (was: 4.7)
                   4.8

> enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter
> ----------------------------------------------------------------------
>
>                 Key: SOLR-1799
>                 URL: https://issues.apache.org/jira/browse/SOLR-1799
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3, 1.4
>            Reporter: Chris Darroch
>            Priority: Minor
>             Fix For: 4.8
>
>         Attachments: SOLR-1799.patch
>
>
> At the bottom of the WordDelimiterFilter.java code there's the following comment:
> // downsides:  if source text is "powershot" then a query of "PowerShot" won't match!
> Another serious example for us might be something like an indexed document containing the word "Tribeca" or "Soho", and then a user trying to search for "TriBeCa" or "SoHo".
> This issue has turned up in a couple of recent mailing list threads:
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200908.mbox/%3cfe4f94830908201429j3ffbcdd3s3cb7d80542b31e48@mail.gmail.com%3e
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200905.mbox/%3c72d9e9500905121619p68c27099ibc7079e52cb0e48e@mail.gmail.com%3e
> In the first thread I found the best explication of what my own misunderstanding was, and it's something I'm sure must trip up other people as well:
> {quote}
> I've misunderstood WordDelimiterFilter.  You might think that catenateAll="1" would append the full phrase (sans delimiters) as an OR against the query.  So "jOkersWild" would produce:
> "j (okers wild)" OR "jokerswild"
> But you thought wrong.  Its actually:
> "j (okers wild jokerswild)"
> Which is confusing and won't match...
> {quote}
> In the second thread, Yonik Seeley gives a good explanation of why this occurs, and provides a suggested workaround where you duplicate your data fields and then query on one using generateWordParts="1" and on the other using catenateWords="1".  That works, but obviously requires data duplication.  In our case, we are also following what I believe is recommended practice and duplicating our data already into stemmed and unstemmed indexes.  To my mind, to further duplicate both of these fields a second time, with no difference in the indexed data of the additional copy, seems needlessly wasteful when the problem lies entirely in the query side of things.
> At any rate, I'm attaching a patch against Solr 1.3 which is rather hacky, but seems to work for us.  In WordDelimiterFilter, if generateWordParts="1" and catenateWords="2", then we move the concatenated word to overlap its position with the first generated token instead of the last (which is the behaviour with catenateWords="1").  We further insert a preceding dummy flag token with the special type "CATENATE_FIRST".
> In SolrPluginUtils in the DisjunctionMaxQueryParser class we just copy in the entirety of the getFieldQuery() code from Lucene's QueryParser.  This is ugly, I know.  This code is then tweaked so that in the case where the dummy flag token is seen, it creates a BooleanQuery with the following token (the concatenated word) as a conditional TermQuery clause, and then adds the generated terms in their usual MultiPhraseQuery as a second conditional clause.
> Now I realize this patch is (a) not likely acceptable on style and elegance grounds, and (b) only against Solr 1.3, not trunk.  My apologies for both; after I'd spent most of what time I had available tracking down the source of the problem, I just needed to get something working quickly.  Perhaps this patch will inspire others to greatness, though, or at a minimum provide a starting point for those who stumble over this same issue.
> Thanks for a great application!  Cheers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org