You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Stephane Nicoll <st...@gmail.com> on 2013/11/21 15:42:24 UTC
tokenizer to strip a set of characters
Hi,
I am using lucene 3.6 and I am looking to a tokenized that would remove
certain characters when they are present at the beginning or at the end of
a token.
I initially used the StandardAnalyzer and switched to the
WhitespaceAnalyser because it was too agressive for my use case.
A few examples:
- foo, -> foo (comma at the end)
- foo. -> foo (period at the end)
- foo!!!! -> foo
- foo?! -> foo
- ,foo -> foo (comma at the beginning of a word is a typo mistake but
should be handled-
Is there a configurable tokenizer I could use for this?
Thanks,
S.
Re: tokenizer to strip a set of characters
Posted by Jack Krupansky <ja...@basetechnology.com>.
The word delimiter filter has the ability to pass a table which specifies
the type for a character:
http://lucene.apache.org/core/4_5_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html
http://lucene.apache.org/core/4_5_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html#WordDelimiterFilter(org.apache.lucene.analysis.TokenStream,
byte[], int, org.apache.lucene.analysis.util.CharArraySet)
There is also a regex token filter that you could use to make fine
adjustments, like character allowed within tokens but ignored at the start
or end.
-- Jack Krupansky
-----Original Message-----
From: Stephane Nicoll
Sent: Thursday, November 21, 2013 9:42 AM
To: java-user@lucene.apache.org
Subject: tokenizer to strip a set of characters
Hi,
I am using lucene 3.6 and I am looking to a tokenized that would remove
certain characters when they are present at the beginning or at the end of
a token.
I initially used the StandardAnalyzer and switched to the
WhitespaceAnalyser because it was too agressive for my use case.
A few examples:
- foo, -> foo (comma at the end)
- foo. -> foo (period at the end)
- foo!!!! -> foo
- foo?! -> foo
- ,foo -> foo (comma at the beginning of a word is a typo mistake but
should be handled-
Is there a configurable tokenizer I could use for this?
Thanks,
S.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org