You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Nathan Tallman <nt...@gmail.com> on 2012/11/15 18:38:20 UTC

Synonym/Tokenizer for Hyphanated Words

Hello Solr users,

I use Solr 3.5 via Vufind 1.3 and am having a problem with a synonym. No
matter what syntax I used, it doesn't seem to have an affect. (See various
combinations below.)

antisemitism,anti-semitism,Antisemitism,Anti-Semitism,Anti-semitism,anti-Semitism

antisemitism,anti\-semitism,Antisemitism,Anti\-Semitism,Anti\-semitism,anti\-Semitism

antisemitism,anti semitism,Antisemitism,Anti Semitism,Anti semitism,anti
Semitism

It was suggested to me that this was not synonym issue, but a tokenizing
issue, because anti-semitism was being interpreted as anti semitism.

Does anyone have any suggested for making the synonym work? Tweaking the
tokenizer in schema.xml? Or somehow escaping the hyphen in synonyms.txt?

Many thanks,
Nathan

Re: Synonym/Tokenizer for Hyphanated Words

Posted by Erick Erickson <er...@gmail.com>.
what does "having a problem" mean? Index-time? Query time?

But your problem is most likely the tokenizer as you suspect. Try something
like WhitespaceTokenizer and build up from there.

Three friends:
1> admin/analysis page
2> admin/schema-browser
3> &debugQuery=on
The first will show you what the happend to tokens _after_ they get through
the tokenization. Be aware that this probably isn't entirely helpful when
your problem is in the tokenization step.

The second shows you what terms are actually in your index.

The third shows you what your parsed query looks like.

Couple of other things:
1> there's no need to put in all the capitalization forms _if_ you put
LowerCaseFilter in front of your synonyms filter.
2> WhiteSpaceTokenizer is pretty simple. For instance, punctuation will be
part of the tokens (e.g. periods at the end of sentences). So it's a place
to _start_ but you'll have to think about what you really want from your
tokenization process before deciding.

Best
Erick


On Thu, Nov 15, 2012 at 12:38 PM, Nathan Tallman <nt...@gmail.com> wrote:

> Hello Solr users,
>
> I use Solr 3.5 via Vufind 1.3 and am having a problem with a synonym. No
> matter what syntax I used, it doesn't seem to have an affect. (See various
> combinations below.)
>
>
> antisemitism,anti-semitism,Antisemitism,Anti-Semitism,Anti-semitism,anti-Semitism
>
>
> antisemitism,anti\-semitism,Antisemitism,Anti\-Semitism,Anti\-semitism,anti\-Semitism
>
> antisemitism,anti semitism,Antisemitism,Anti Semitism,Anti semitism,anti
> Semitism
>
> It was suggested to me that this was not synonym issue, but a tokenizing
> issue, because anti-semitism was being interpreted as anti semitism.
>
> Does anyone have any suggested for making the synonym work? Tweaking the
> tokenizer in schema.xml? Or somehow escaping the hyphen in synonyms.txt?
>
> Many thanks,
> Nathan
>