You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by ami dudu <am...@gmail.com> on 2009/06/03 13:07:18 UTC

Enhance StandardTokenizer to support words which will not be tokenized

Hi, I'm using a StandardTokenizer which do great job for me but i need to
enhance it somehow to consider words like "c++" "c#", ".net" as is and not
tokenized it into "c" or "net".
I know that there are other tokenizers such as KeywordTokenizer and
WhitespaceTokenizer but they do not include the StandardTokenizer  logic.
Any ideas on what is the best way to add this enhancement?

Thanks,
Amid
-- 
View this message in context: http://www.nabble.com/Enhance-StandardTokenizer-to-support-words-which-will-not-be-tokenized-tp23849495p23849495.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Enhance StandardTokenizer to support words which will not be tokenized

Posted by ami dudu <am...@gmail.com>.

This can be good solution but it will have to be maintained every update of
the StandardAnalyzer rules.
Is there a way to workaround it?


Grant Ingersoll-6 wrote:
> 
> You'd have to modify the JFlex grammar.  I'd suggest adding in a  
> generic "protected words" approach whereby you can pass in a list of  
> protected words.
> 
> This would be a nice patch/improvement.
> 
> -Grant
> 
> On Jun 3, 2009, at 4:07 AM, ami dudu wrote:
> 
>>
>> Hi, I'm using a StandardTokenizer which do great job for me but i  
>> need to
>> enhance it somehow to consider words like "c++" "c#", ".net" as is  
>> and not
>> tokenized it into "c" or "net".
>> I know that there are other tokenizers such as KeywordTokenizer and
>> WhitespaceTokenizer but they do not include the StandardTokenizer   
>> logic.
>> Any ideas on what is the best way to add this enhancement?
>>
>> Thanks,
>> Amid
>> -- 
>> View this message in context:
>> http://www.nabble.com/Enhance-StandardTokenizer-to-support-words-which-will-not-be-tokenized-tp23849495p23849495.html
>> Sent from the Lucene - Java Developer mailing list archive at  
>> Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Enhance-StandardTokenizer-to-support-words-which-will-not-be-tokenized-tp23849495p23857450.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Enhance StandardTokenizer to support words which will not be tokenized

Posted by Earwin Burrfoot <ea...@gmail.com>.

Not sure you can easily marry generated JFlex grammar and
runtime-provided list of protected words.
I took the approach of creating tokens for punctuation inside my
tokenizer and later gluing them with nearby text tokens or dropping
from the stream with a tokenfilter.

On Wed, Jun 3, 2009 at 20:10, Grant Ingersoll <gs...@apache.org> wrote:
> You'd have to modify the JFlex grammar.  I'd suggest adding in a generic
> "protected words" approach whereby you can pass in a list of protected
> words.
>
> This would be a nice patch/improvement.
>
> -Grant
>
> On Jun 3, 2009, at 4:07 AM, ami dudu wrote:
>
>>
>> Hi, I'm using a StandardTokenizer which do great job for me but i need to
>> enhance it somehow to consider words like "c++" "c#", ".net" as is and not
>> tokenized it into "c" or "net".
>> I know that there are other tokenizers such as KeywordTokenizer and
>> WhitespaceTokenizer but they do not include the StandardTokenizer  logic.
>> Any ideas on what is the best way to add this enhancement?
>>
>> Thanks,
>> Amid
>> --
>> View this message in context:
>> http://www.nabble.com/Enhance-StandardTokenizer-to-support-words-which-will-not-be-tokenized-tp23849495p23849495.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Enhance StandardTokenizer to support words which will not be tokenized

Posted by Grant Ingersoll <gs...@apache.org>.

You'd have to modify the JFlex grammar.  I'd suggest adding in a  
generic "protected words" approach whereby you can pass in a list of  
protected words.

This would be a nice patch/improvement.

-Grant

On Jun 3, 2009, at 4:07 AM, ami dudu wrote:

>
> Hi, I'm using a StandardTokenizer which do great job for me but i  
> need to
> enhance it somehow to consider words like "c++" "c#", ".net" as is  
> and not
> tokenized it into "c" or "net".
> I know that there are other tokenizers such as KeywordTokenizer and
> WhitespaceTokenizer but they do not include the StandardTokenizer   
> logic.
> Any ideas on what is the best way to add this enhancement?
>
> Thanks,
> Amid
> -- 
> View this message in context: http://www.nabble.com/Enhance-StandardTokenizer-to-support-words-which-will-not-be-tokenized-tp23849495p23849495.html
> Sent from the Lucene - Java Developer mailing list archive at  
> Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org