You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Bhavin Pandya <bh...@rediff.co.in> on 2007/09/17 16:55:48 UTC

How to tokenize with comma in standard tokenizer

Hi,

Standard tokenizer works pretty well for me... but i found one problem with my usage...

I want to tokenize..."TheRing6,Proposal6,GuyandGirl6" as a three saparate tokens.. while standard analyzer considering it as a one word because it has one digit in token.

Expected three tokens:
1. thering6
2. proposal6
3. guyandgirl6

i want to change this behaviour of standard tokenizer for this purpose.... But i dont know where to change....
Do i need to comment some rule in StandardTokenizer.jj file ???  I am confused with this file....

Any pointer...

- Bhavin

Re: How to tokenize with comma in standard tokenizer

Posted by Bhavin Pandya <bh...@rediff.co.in>.

Thanks mark.

> Take the comma out of: | <#P: ("_"|"-"|"/"|"."|",") > in the .jj file

Its working for me...

- Bhavin pandya


----- Original Message ----- 
From: "Mark Miller" <ma...@gmail.com>
To: <ja...@lucene.apache.org>
Sent: Monday, September 17, 2007 8:34 PM
Subject: Re: How to tokenize with comma in standard tokenizer


> Take the comma out of: | <#P: ("_"|"-"|"/"|"."|",") > in the .jj file 
> (around line 92). Keep in mind that this will affect being able to find 
> tokens that where previously indexed with the comma there (obviously). I 
> believe the javacc target in the build file will rebuild...you need to get 
> javacc and put a prop file next to the build file called build.properties 
> that contains: javacc.home=c:/javacc (or wherever you put javacc).
>
> Also, you could consider trying to pre-process the strings (replace the 
> comma with a space or something).
>
> - Mark
>
> Bhavin Pandya wrote:
>> Hi,
>>
>> Standard tokenizer works pretty well for me... but i found one problem 
>> with my usage...
>>
>> I want to tokenize..."TheRing6,Proposal6,GuyandGirl6" as a three saparate 
>> tokens.. while standard analyzer considering it as a one word because it 
>> has one digit in token.
>>
>> Expected three tokens:
>> 1. thering6
>> 2. proposal6
>> 3. guyandgirl6
>>
>> i want to change this behaviour of standard tokenizer for this 
>> purpose.... But i dont know where to change....
>> Do i need to comment some rule in StandardTokenizer.jj file ???  I am 
>> confused with this file....
>>
>> Any pointer...
>>
>> - Bhavin
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to tokenize with comma in standard tokenizer

Posted by Mark Miller <ma...@gmail.com>.

Take the comma out of: | <#P: ("_"|"-"|"/"|"."|",") > in the .jj file 
(around line 92). Keep in mind that this will affect being able to find 
tokens that where previously indexed with the comma there (obviously). I 
believe the javacc target in the build file will rebuild...you need to 
get javacc and put a prop file next to the build file called 
build.properties that contains: javacc.home=c:/javacc (or wherever you 
put javacc).

Also, you could consider trying to pre-process the strings (replace the 
comma with a space or something).

- Mark

Bhavin Pandya wrote:
> Hi,
>
> Standard tokenizer works pretty well for me... but i found one problem with my usage...
>
> I want to tokenize..."TheRing6,Proposal6,GuyandGirl6" as a three saparate tokens.. while standard analyzer considering it as a one word because it has one digit in token.
>
> Expected three tokens:
> 1. thering6
> 2. proposal6
> 3. guyandgirl6
>
> i want to change this behaviour of standard tokenizer for this purpose.... But i dont know where to change....
> Do i need to comment some rule in StandardTokenizer.jj file ???  I am confused with this file....
>
> Any pointer...
>
> - Bhavin
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org