You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Benjamin Higgins <bh...@seattletimes.com> on 2008/01/07 23:15:10 UTC

Problem with camelCase but not casing in general

Hi all, I am using a mostly out-of-the-box install of Solr that I'm
using to search through our code repositories.  I've run into a funny
problem where searches for text that is camelCased aren't returning
results unless the casing is exactly the same.  

For example, a query for "getElementById" returns 364 results, but
"getelementbyid" returns 0.

There isn't a problem with all casings, though.  For example, "function"
and "Function" both return the same number of results, as does
"FUNCTION" and "FUNCtion" (6,278 with my docs).  However, "funcTION"
returns only a few results--and it's where the word is actually split up
(e.g. "func tion")!

So it seems that something may be tokenizing words where casing appears
in the middle of them!

How can I get this to stop?

Thanks!

Ben


Here's the definition for the text field type in my schema.xml:

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

Re: Problem with camelCase but not casing in general

Posted by Mike Klaas <mi...@gmail.com>.

On 7-Jan-08, at 3:21 PM, Benjamin Higgins wrote:

>> Well, he might want to split on punctuation.
>
> I do, so I just turned off splitOnCaseChange instead of removing
> WordDelimiterFilterFactory completely.
>
> It's looking good now!
>
>> The OP's problem might have to do with index/query-time analyzer
>> mismatch.  We'd know more if he posted the schema definitions.
>
> I did post a portion of my schema in my original email.  I think  
> I'm OK
> there, since I don't recall fiddling with it any.

Ah, I see it now.  How very odd that that was a problem given that  
schema.

You also might want to consider turning off the stemming for code  
search.

-Mike

RE: Problem with camelCase but not casing in general

Posted by Benjamin Higgins <bh...@seattletimes.com>.

> Well, he might want to split on punctuation.

I do, so I just turned off splitOnCaseChange instead of removing
WordDelimiterFilterFactory completely.

It's looking good now!

> The OP's problem might have to do with index/query-time analyzer  
> mismatch.  We'd know more if he posted the schema definitions.

I did post a portion of my schema in my original email.  I think I'm OK
there, since I don't recall fiddling with it any.

Thanks everyone.

Ben

Re: Problem with camelCase but not casing in general

Posted by Mike Klaas <mi...@gmail.com>.

On 7-Jan-08, at 2:35 PM, Yonik Seeley wrote:
>
> Anyway, if splits on capitalization changes is not desired, getting
> rid of the WordDelimiterFilter in both the index and query analyzers
> is the right thing to do.
>
Well, he might want to split on punctuation.

self.object.frobulation.method()

probably shouldn't be one token.

The OP's problem might have to do with index/query-time analyzer  
mismatch.  We'd know more if he posted the schema definitions.

-Mike

Re: Problem with camelCase but not casing in general

Posted by Yonik Seeley <yo...@apache.org>.

On Jan 7, 2008 5:26 PM, Brendan Grainger <br...@gmail.com> wrote:
> I think your problem is happening because splitOnCaseChange is 1 in
> your WordDelimiterFilterFactory:
>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>
> So "getElementById" is tokenized to:
>
> (get,0,3)
> (Element,3,10)
> (By,10,12)
> (Id,12,14)
> (getElementById,0,14,posIncr=0)
>
> However getelementbyid is tokenized to:
>
> (getelementbyid,0,14)
>
> which wouldn't be a term in the index??

It would be a term in the index since both go through the lowercase filter.

Anyway, if splits on capitalization changes is not desired, getting
rid of the WordDelimiterFilter in both the index and query analyzers
is the right thing to do.

-Yonik

Re: Problem with camelCase but not casing in general

Posted by Brendan Grainger <br...@gmail.com>.

I think your problem is happening because splitOnCaseChange is 1 in  
your WordDelimiterFilterFactory:

<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

So "getElementById" is tokenized to:

(get,0,3)
(Element,3,10)
(By,10,12)
(Id,12,14)
(getElementById,0,14,posIncr=0)

However getelementbyid is tokenized to:

(getelementbyid,0,14)

which wouldn't be a term in the index??

I'm sure someone who knows more about solr will answer, but maybe  
that will help.

On Jan 7, 2008, at 5:15 PM, Benjamin Higgins wrote:

>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

Re: Problem with camelCase but not casing in general

Posted by Yonik Seeley <yo...@apache.org>.

On Jan 7, 2008 5:15 PM, Benjamin Higgins <bh...@seattletimes.com> wrote:
> Hi all, I am using a mostly out-of-the-box install of Solr that I'm
> using to search through our code repositories.  I've run into a funny
> problem where searches for text that is camelCased aren't returning
> results unless the casing is exactly the same.
>
> For example, a query for "getElementById" returns 364 results, but
> "getelementbyid" returns 0.
>
> There isn't a problem with all casings, though.  For example, "function"
> and "Function" both return the same number of results, as does
> "FUNCTION" and "FUNCtion" (6,278 with my docs).  However, "funcTION"
> returns only a few results--and it's where the word is actually split up
> (e.g. "func tion")!
>
> So it seems that something may be tokenizing words where casing appears
> in the middle of them!
>
> How can I get this to stop?

remove WordDelimiterFilter.

It's funny though, since WordDelimiterFilter should not have caused
this to happen (a query of getelementbyid should have matched a doc
with getElementById).

-Yonik

> Thanks!
>
> Ben
>
>
> Here's the definition for the text field type in my schema.xml:
>
>     <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!-- in this example, we will only use synonyms at query time
>         <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>         -->
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
>