You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Wade Leftwich <wa...@leftwich.us> on 2006/12/14 05:32:11 UTC

Case sensitivity on hostnames and email addresses

I've run into some unexpected case sensitivity on searches, at least
unexpected by me.

If you index a text field containing this sentence:

A sentence containing CamelCase words by UpAndDown@mysite.com is found
at StudlyCaps.org

The document will be found by searching for "camelcase" but not for
"upanddown@mysite.com" or "studlycaps.org".

This happens with the Standard or the DisMax query handler.

A bit of a problem for me, because I'm indexing a bunch of business
magazines, and domain names are frequently capitalized, often in CamelCase.

Is this maybe a bug? Or a WAD?

-- Wade Leftwich
Ithaca, NY


Re: Case sensitivity on hostnames and email addresses

Posted by Yonik Seeley <yo...@apache.org>.
Oh, and yet another way to get around it (with it's own trade offs) is
to use something like fieldtype textTight in the example schema.xml,
which catenates all word parts in both the index analyzer and query
analyzer.

This would index as "upanddownmysitecom" and allow the following
queries to match:
"UpAndDown@mysite.com", "up-and-down@mysite/com", "upanddown@mysite.com"

The downside is that it would *not* allow "upanddown" or "UpAndDown" to match.

-Yonik

On 12/14/06, Yonik Seeley <yo...@apache.org> wrote:
> On 12/13/06, Wade Leftwich <wa...@leftwich.us> wrote:
> > I've run into some unexpected case sensitivity on searches, at least
> > unexpected by me.
> >
> > If you index a text field containing this sentence:
> >
> > A sentence containing CamelCase words by UpAndDown@mysite.com is found
> > at StudlyCaps.org
> >
> > The document will be found by searching for "camelcase" but not for
> > "upanddown@mysite.com" or "studlycaps.org".
> >
> > This happens with the Standard or the DisMax query handler.
> >
> > A bit of a problem for me, because I'm indexing a bunch of business
> > magazines, and domain names are frequently capitalized, often in CamelCase.
>
> It's your text analysis configuration.
> The WordDelimiterFilter is doing this... it's so "CamelCase" can be
> found searching for "camelcase", "camel-case" or "camel case".
> It does this by detecting all the word parts and then indexing them
> separately as well as all catenated.  So "CamelCase" is indexed as
> both both "camelcase" and "camel case".
> When searching, the WordDelimiterFilter is configured to split only,
> so "camelcase", "camel-case", and "camel case" will all match.
>
> When it hits something like UpAndDown@mysite.com, it would index it as
> "upanddownmysitecom" and "up and down mysite com"
> On the search side, a search of "upanddown@mysite.com" is broken into
> "upanddown mysite com" which doesn't match anything indexed.
>
> There are a number of options, not limited to
>  - create a new fieldtype and throw out the WordDelimiterFilter... the
> current "text"
>    field type is for demonstration purposes only anyway.  Solr, like
> Lucene, is meant
>    to be customized.
>  - If you want to keep the camel-case flexibility, but not across "."
> and "-", then
>    try using a letter tokenizer to throw away the non-letter tokenizers first.
>  - create a specific filter for email or website addresses if no combination of
>    existing filters do what you want.
>
> Play around with the analysis tool on the admin page, it will help you
> understand what's going on.
>
> -Yonik
>

Re: Case sensitivity on hostnames and email addresses

Posted by Yonik Seeley <yo...@apache.org>.
On 12/13/06, Wade Leftwich <wa...@leftwich.us> wrote:
> I've run into some unexpected case sensitivity on searches, at least
> unexpected by me.
>
> If you index a text field containing this sentence:
>
> A sentence containing CamelCase words by UpAndDown@mysite.com is found
> at StudlyCaps.org
>
> The document will be found by searching for "camelcase" but not for
> "upanddown@mysite.com" or "studlycaps.org".
>
> This happens with the Standard or the DisMax query handler.
>
> A bit of a problem for me, because I'm indexing a bunch of business
> magazines, and domain names are frequently capitalized, often in CamelCase.

It's your text analysis configuration.
The WordDelimiterFilter is doing this... it's so "CamelCase" can be
found searching for "camelcase", "camel-case" or "camel case".
It does this by detecting all the word parts and then indexing them
separately as well as all catenated.  So "CamelCase" is indexed as
both both "camelcase" and "camel case".
When searching, the WordDelimiterFilter is configured to split only,
so "camelcase", "camel-case", and "camel case" will all match.

When it hits something like UpAndDown@mysite.com, it would index it as
"upanddownmysitecom" and "up and down mysite com"
On the search side, a search of "upanddown@mysite.com" is broken into
"upanddown mysite com" which doesn't match anything indexed.

There are a number of options, not limited to
 - create a new fieldtype and throw out the WordDelimiterFilter... the
current "text"
   field type is for demonstration purposes only anyway.  Solr, like
Lucene, is meant
   to be customized.
 - If you want to keep the camel-case flexibility, but not across "."
and "-", then
   try using a letter tokenizer to throw away the non-letter tokenizers first.
 - create a specific filter for email or website addresses if no combination of
   existing filters do what you want.

Play around with the analysis tool on the admin page, it will help you
understand what's going on.

-Yonik