You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Earwin Burrfoot <ea...@gmail.com> on 2009/06/03 18:20:52 UTC

Re: Enhance StandardTokenizer to support words which will not be tokenized

Not sure you can easily marry generated JFlex grammar and
runtime-provided list of protected words.
I took the approach of creating tokens for punctuation inside my
tokenizer and later gluing them with nearby text tokens or dropping
from the stream with a tokenfilter.

On Wed, Jun 3, 2009 at 20:10, Grant Ingersoll <gs...@apache.org> wrote:
> You'd have to modify the JFlex grammar.  I'd suggest adding in a generic
> "protected words" approach whereby you can pass in a list of protected
> words.
>
> This would be a nice patch/improvement.
>
> -Grant
>
> On Jun 3, 2009, at 4:07 AM, ami dudu wrote:
>
>>
>> Hi, I'm using a StandardTokenizer which do great job for me but i need to
>> enhance it somehow to consider words like "c++" "c#", ".net" as is and not
>> tokenized it into "c" or "net".
>> I know that there are other tokenizers such as KeywordTokenizer and
>> WhitespaceTokenizer but they do not include the StandardTokenizer  logic.
>> Any ideas on what is the best way to add this enhancement?
>>
>> Thanks,
>> Amid
>> --
>> View this message in context:
>> http://www.nabble.com/Enhance-StandardTokenizer-to-support-words-which-will-not-be-tokenized-tp23849495p23849495.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org