You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jeff <je...@gmail.com> on 2007/07/12 15:40:21 UTC

replace values in index

I have documents with lots of text. Part of the text is in the following
format:

word1,word2,word3,word4,word5

I am currently using the StandardAnalyzer and everything is working great
with the other data, except I can't query for 'word3' as a ',' isn't a token
seperator. Is there an easy way to add ',' as a token seperator?

Thanks,

-Jeff

Re: replace values in index

Posted by Mark Miller <ma...@gmail.com>.
While it is possible to alter the StandardAnalyzer, depending on more 
details of your source text, it may be better to use a different 
analyzer or make your own. The StandardAnalyzer is quite slow if you do 
not need all of its features, and modifying it will make it harder to 
keep up with bug fixes or improvements.

That said, StandardAnalyzer does split on commas, so you might want to 
check into whats really going on.

I suspect that 'word1,word2,word3,word4,word5' is being recognized as a 
NUM by StandardAnalzyer. A NUM match will keep a comma deliminated list 
intact as long as every other word contains a digit.

You might alter the <#P regular expression in StandardAnalyzer.jj by 
taking out the ','. This will take out certain matches (like the match 
your getting <g>), but will stop screwing up your matches.
 
- Mark

Jeff wrote:
> I have documents with lots of text. Part of the text is in the following
> format:
>
> word1,word2,word3,word4,word5
>
> I am currently using the StandardAnalyzer and everything is working great
> with the other data, except I can't query for 'word3' as a ',' isn't a 
> token
> seperator. Is there an easy way to add ',' as a token seperator?
>
> Thanks,
>
> -Jeff
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org