You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Lisa Riggle <li...@linkup.com> on 2011/07/01 17:20:47 UTC

non-alphanumeric character searching

Hi Everyone!

I'm very new to solr and have a question that I hope you all can answer.

My boss has me learning solr for work, with the specific goal of
improving the schema on one of the cores of our site.  This core
consists of nothing but company names from our database, so I think that
makes things easier, since there's no need to worry about parsing email
address or URL's or anything.

Anyway, I am running into some problems with non-alphanumeric characters
in company names causing searches to return wild results. For example,
there is 1 company in our database stored as /HPC Inter@ctive
(ApartmentGuide.com)/.  In my test script, I have a couple of different
search strings that don't seem to return consistent results.  For
example:/hpc inter@ctive/ returns 1 result (yay), but /hpc inter@ctive
(apartment/ and /hpc inter@ctive (apartmentguide/ both return 0
results.  /inter@ctive/ by itself returns 832 results.

Among other issues, I'm having a heck of a time trying to figure out how
to make solr just search for "inter@ctive" as a whole word instead of
splitting it up at the @ and searching for "inter" and "ctive".

How do I get solr to ignore special characters, like @, and just treat
it as part of the string?

I've spent some time trying out diffrerent tokenizers and filters, and
rearranging the order of some of the filters.  Doing that does affect
the results at times, but mostly I get the results listed above.  I also
tried using the PatternReplacefilterFactory to just remove all special
characters from the index/search strings, but I'm fantastically bad at
regex, so that didn't work either.

I appreciate any and all advice.
Thanks!
--Lisa

----------

I'm running a default install of solr 3.2 with the following schema:

    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="[\'\.\-]" replacement="" replace="all" />
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0"
          catenateNumbers="0" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.PositionFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="[\'\.\-]" replacement="" replace="all" />
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
        <filter class="solr.SynonymFilterFactory" synonyms="syn.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0"
          catenateNumbers="0" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.PositionFilterFactory" />
      </analyzer>
    </fieldType>


I don't know what other information is needed to point me in the right
direction, but please let me know if there's something I can send that
will be of assistance.

Re: non-alphanumeric character searching

Posted by Erick Erickson <er...@gmail.com>.

I'd start by removing lots of stuff, particularly
WordDelimiterFilterFactory. That's splitting your input up by non-alpha
characters.

If you really want just the string stored, try just using KeywordTokenizer
and LowerCaseFilter (although AsciFolding... wouln't hurt).

But the best way to understand all the effects of various analysis chains is
to use the admin/analysis page (be sure to turn the verbose output on).
It'll show you exctly what produces what transformations. Another very
useful tool is adding &debugQuery=on to your URLs and looking at the parsed
output.

Oh, and I expect that some of your unexpected results are a result of
searching against our default field. Searching "example:stuff" will search
against the field "example". Searching "stuff" will search against the
<defaultField> in your schema.xml (note that the &debugQuery=on will show
this)....

Hope this helps
Erick
On Jul 1, 2011 11:19 AM, "Lisa Riggle" <li...@linkup.com> wrote:
> Hi Everyone!
>
> I'm very new to solr and have a question that I hope you all can answer.
>
> My boss has me learning solr for work, with the specific goal of
> improving the schema on one of the cores of our site. This core
> consists of nothing but company names from our database, so I think that
> makes things easier, since there's no need to worry about parsing email
> address or URL's or anything.
>
> Anyway, I am running into some problems with non-alphanumeric characters
> in company names causing searches to return wild results. For example,
> there is 1 company in our database stored as /HPC Inter@ctive
> (ApartmentGuide.com)/. In my test script, I have a couple of different
> search strings that don't seem to return consistent results. For
> example:/hpc inter@ctive/ returns 1 result (yay), but /hpc inter@ctive
> (apartment/ and /hpc inter@ctive (apartmentguide/ both return 0
> results. /inter@ctive/ by itself returns 832 results.
>
> Among other issues, I'm having a heck of a time trying to figure out how
> to make solr just search for "inter@ctive" as a whole word instead of
> splitting it up at the @ and searching for "inter" and "ctive".
>
> How do I get solr to ignore special characters, like @, and just treat
> it as part of the string?
>
> I've spent some time trying out diffrerent tokenizers and filters, and
> rearranging the order of some of the filters. Doing that does affect
> the results at times, but mostly I get the results listed above. I also
> tried using the PatternReplacefilterFactory to just remove all special
> characters from the index/search strings, but I'm fantastically bad at
> regex, so that didn't work either.
>
> I appreciate any and all advice.
> Thanks!
> --Lisa
>
> ----------
>
> I'm running a default install of solr 3.2 with the following schema:
>
> <fieldType name="string" class="solr.StrField" sortMissingLast="true"
omitNorms="true"/>
> <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
omitNorms="true" positionIncrementGap="0"/>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.PatternReplaceFilterFactory" pattern="[\'\.\-]"
replacement="" replace="all" />
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="25" />
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="1" splitOnCaseChange="1"
preserveOriginal="1"/>
> <filter class="solr.PositionFilterFactory" />
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.PatternReplaceFilterFactory" pattern="[\'\.\-]"
replacement="" replace="all" />
> <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
> <filter class="solr.SynonymFilterFactory" synonyms="syn.txt"
ignoreCase="true" expand="true"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="1" splitOnCaseChange="1"
preserveOriginal="1"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> <filter class="solr.PositionFilterFactory" />
> </analyzer>
> </fieldType>
>
>
> I don't know what other information is needed to point me in the right
> direction, but please let me know if there's something I can send that
> will be of assistance.