You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Kai Gülzau <kg...@novomind.com> on 2013/05/17 18:26:44 UTC

StandardTokenizer vs. hyphens

Is there some StandardTokenizer Implementation which does not break words on hyphens?

I think it would be more flexible to retain hyphens and use a WordDelimiterFactory to split these tokens.


StandardTokenizer today:
doc1: email -> email
doc2: e-mail -> e|mail
doc3: e mail -> e|mail

query1: email -> doc1
query2: e-mail -> doc2,doc3
query2: e mail -> doc2,doc3


StandardTokenizer which keeps hyphens + WDF:
doc1: email -> email
doc2: e-mail -> e-mail|email|e|mail
doc3: e mail -> e|mail

query1: email -> doc1,doc2
query2: e-mail -> doc1,doc2,doc3
query2: e mail -> doc2,doc3


Any suggestions to configure or code the 2nd behavior?

Regards,

Kai Gülzau

Re: StandardTokenizer vs. hyphens

Posted by Shawn Heisey <so...@elyograg.org>.
On 5/17/2013 10:26 AM, Kai Gülzau wrote:
> Is there some StandardTokenizer Implementation which does not break words on hyphens?
> 
> I think it would be more flexible to retain hyphens and use a WordDelimiterFactory to split these tokens.

You can use the whitespace tokenizer with WDF.  This is what I did for
my index up through 3.5.

In 4.x, I wanted to be able to use the CJK filters.  The CJK filters
don't work with the whitespace tokenizer, only the ICU or standard.
Until recently, the ICU tokenizer was just as aggressive as the standard
one on punctuation, so it wouldn't work either.

Thanks to SOLR-4123, I was able to change tokenizers so I could still
use WDF.  This issue adds the capability to change how the ICU tokenizer
works via a rule file.  Here is my fieldType:

http://pastie.org/private/tjd9pk6sfgohyhpfbpn7q

The custom rule capability on the ICU tokenizer is in 4.1 or later, and
the Latin-break-only-on-whitespace.rbbi file that I am using in my
schema can be found in the Solr source code.

https://issues.apache.org/jira/browse/SOLR-4123

Thanks,
Shawn