You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Benjamin Higgins <bh...@seattletimes.com> on 2008/01/07 23:15:10 UTC
Problem with camelCase but not casing in general
Hi all, I am using a mostly out-of-the-box install of Solr that I'm
using to search through our code repositories. I've run into a funny
problem where searches for text that is camelCased aren't returning
results unless the casing is exactly the same.
For example, a query for "getElementById" returns 364 results, but
"getelementbyid" returns 0.
There isn't a problem with all casings, though. For example, "function"
and "Function" both return the same number of results, as does
"FUNCTION" and "FUNCtion" (6,278 with my docs). However, "funcTION"
returns only a few results--and it's where the word is actually split up
(e.g. "func tion")!
So it seems that something may be tokenizing words where casing appears
in the middle of them!
How can I get this to stop?
Thanks!
Ben
Here's the definition for the text field type in my schema.xml:
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Re: Problem with camelCase but not casing in general
Posted by Mike Klaas <mi...@gmail.com>.
On 7-Jan-08, at 3:21 PM, Benjamin Higgins wrote:
>> Well, he might want to split on punctuation.
>
> I do, so I just turned off splitOnCaseChange instead of removing
> WordDelimiterFilterFactory completely.
>
> It's looking good now!
>
>> The OP's problem might have to do with index/query-time analyzer
>> mismatch. We'd know more if he posted the schema definitions.
>
> I did post a portion of my schema in my original email. I think
> I'm OK
> there, since I don't recall fiddling with it any.
Ah, I see it now. How very odd that that was a problem given that
schema.
You also might want to consider turning off the stemming for code
search.
-Mike
RE: Problem with camelCase but not casing in general
Posted by Benjamin Higgins <bh...@seattletimes.com>.
> Well, he might want to split on punctuation.
I do, so I just turned off splitOnCaseChange instead of removing
WordDelimiterFilterFactory completely.
It's looking good now!
> The OP's problem might have to do with index/query-time analyzer
> mismatch. We'd know more if he posted the schema definitions.
I did post a portion of my schema in my original email. I think I'm OK
there, since I don't recall fiddling with it any.
Thanks everyone.
Ben
Re: Problem with camelCase but not casing in general
Posted by Mike Klaas <mi...@gmail.com>.
On 7-Jan-08, at 2:35 PM, Yonik Seeley wrote:
>
> Anyway, if splits on capitalization changes is not desired, getting
> rid of the WordDelimiterFilter in both the index and query analyzers
> is the right thing to do.
>
Well, he might want to split on punctuation.
self.object.frobulation.method()
probably shouldn't be one token.
The OP's problem might have to do with index/query-time analyzer
mismatch. We'd know more if he posted the schema definitions.
-Mike
Re: Problem with camelCase but not casing in general
Posted by Yonik Seeley <yo...@apache.org>.
On Jan 7, 2008 5:26 PM, Brendan Grainger <br...@gmail.com> wrote:
> I think your problem is happening because splitOnCaseChange is 1 in
> your WordDelimiterFilterFactory:
>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>
> So "getElementById" is tokenized to:
>
> (get,0,3)
> (Element,3,10)
> (By,10,12)
> (Id,12,14)
> (getElementById,0,14,posIncr=0)
>
> However getelementbyid is tokenized to:
>
> (getelementbyid,0,14)
>
> which wouldn't be a term in the index??
It would be a term in the index since both go through the lowercase filter.
Anyway, if splits on capitalization changes is not desired, getting
rid of the WordDelimiterFilter in both the index and query analyzers
is the right thing to do.
-Yonik
Re: Problem with camelCase but not casing in general
Posted by Brendan Grainger <br...@gmail.com>.
I think your problem is happening because splitOnCaseChange is 1 in
your WordDelimiterFilterFactory:
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
So "getElementById" is tokenized to:
(get,0,3)
(Element,3,10)
(By,10,12)
(Id,12,14)
(getElementById,0,14,posIncr=0)
However getelementbyid is tokenized to:
(getelementbyid,0,14)
which wouldn't be a term in the index??
I'm sure someone who knows more about solr will answer, but maybe
that will help.
On Jan 7, 2008, at 5:15 PM, Benjamin Higgins wrote:
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
Re: Problem with camelCase but not casing in general
Posted by Yonik Seeley <yo...@apache.org>.
On Jan 7, 2008 5:15 PM, Benjamin Higgins <bh...@seattletimes.com> wrote:
> Hi all, I am using a mostly out-of-the-box install of Solr that I'm
> using to search through our code repositories. I've run into a funny
> problem where searches for text that is camelCased aren't returning
> results unless the casing is exactly the same.
>
> For example, a query for "getElementById" returns 364 results, but
> "getelementbyid" returns 0.
>
> There isn't a problem with all casings, though. For example, "function"
> and "Function" both return the same number of results, as does
> "FUNCTION" and "FUNCtion" (6,278 with my docs). However, "funcTION"
> returns only a few results--and it's where the word is actually split up
> (e.g. "func tion")!
>
> So it seems that something may be tokenizing words where casing appears
> in the middle of them!
>
> How can I get this to stop?
remove WordDelimiterFilter.
It's funny though, since WordDelimiterFilter should not have caused
this to happen (a query of getelementbyid should have matched a doc
with getElementById).
-Yonik
> Thanks!
>
> Ben
>
>
> Here's the definition for the text field type in my schema.xml:
>
> <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <!-- in this example, we will only use synonyms at query time
> <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> -->
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
>
>