You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mike Richmond <ri...@gmail.com> on 2006/06/21 20:50:28 UTC

Custom E-mail Tokenizer

I have created a custom e-mail tokenizer and am trying to make e-mail
addresses more searchable inside of solr (without having to rely on
wildcard/prefix queries), but am running into a couple problems using
it.

I created a tokenizer that when given the e-mail address
"java-user@lucene.apache.org" it produces the following tokens (this
was discussed on the java lucene users group and can be found here:
http://www.nabble.com/indexing-emails-t1800267.html#a4932444):
    java-user@lucene.apache.org
    java
    user
    java-user
    lucene.apache.org
    lucene
    apache.org
    org


I then added the following to my schema configuration:
    <fieldtype name="email" class="solr.StrField">
        <analyzer type="index">
            <tokenizer
class="com.willetts.wmail.analysis.EmailTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldtype>


If I then fire up solr and use the analysis tool from the admin page,
it seems to work exacly as I would expect (i.e. email addresses that I
type in do get broken up into the correct tokens).  However, when I
add data to this index and then attempt to perform a search using the
search interface I can not get any matches.  For example when I add
"richmondmike@gmail.com" to a field that has type "email" (see schema
configuration above) I can not get the terms "richmondmike", or
"gmail" or "gmail.com" to match any of the results.


Do I need to use a custom fieldtype class as well instead of using
"solr.StrField"?  Any help would be greatly appreciated.


Thanks in advance,

Mike

Re: Custom E-mail Tokenizer

Posted by Mike Richmond <ri...@gmail.com>.
Worked like a champ.  Thanks for the quick reply.


--Mike


On 6/21/06, Chris Hostetter <ho...@fucit.org> wrote:
>
> :     <fieldtype name="email" class="solr.StrField">
> :         <analyzer type="index">
> :             <tokenizer
> : class="com.willetts.wmail.analysis.EmailTokenizerFactory"/>
> :             <filter class="solr.LowerCaseFilterFactory"/>
> :         </analyzer>
> :     </fieldtype>
>
> Try changing the fieldtype class to solr.TextField ... i've never seen
> anyone try to use an analyzer with StrField (if you'd asked me before you
> tried it, i would have guess the schema file wouldn't have even loaded
> properly)
>
>
> -Hoss
>
>

Re: Custom E-mail Tokenizer

Posted by Chris Hostetter <ho...@fucit.org>.
:     <fieldtype name="email" class="solr.StrField">
:         <analyzer type="index">
:             <tokenizer
: class="com.willetts.wmail.analysis.EmailTokenizerFactory"/>
:             <filter class="solr.LowerCaseFilterFactory"/>
:         </analyzer>
:     </fieldtype>

Try changing the fieldtype class to solr.TextField ... i've never seen
anyone try to use an analyzer with StrField (if you'd asked me before you
tried it, i would have guess the schema file wouldn't have even loaded
properly)


-Hoss