You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Libbrecht <pa...@activemath.org> on 2009/03/14 23:36:28 UTC

underscore a word separator in StandardAnalyzer?

Hello fellows of Lucene,

I just discovered that the _ character is a word separator in the  
StandardAnalyzer.
Can it be?
It broke our usage of a field that stores a comma-separated list of  
"uri-fragments" which, of course, contain _: the standard-analyzer  
splits these as separate term which fully-fuzzifies the search.

Is there any rationale? A past debate about that?
I would feel my candid approach to be rather common: underscore makes  
new words out of existing words, dash makes composed words.

I sure know I can try to adapt standard-analyzer! I wanted to know the  
reasons.

paul

Re: underscore a word separator in StandardAnalyzer?

Posted by Paul Libbrecht <pa...@activemath.org>.
Sure, all this is possible, I would know how to make my analyzer.
I just faced this on an existing solution which takes StandardAnalyzer  
for almost everything because it's generic. Therefore I wanted to know  
if there was a rationale.
Of cours URIs have trivial analyzers.

paul


Le 16-mars-09 à 00:03, Daniel Noll a écrit :

> Paul Libbrecht wrote:
>> Hello fellows of Lucene,
>> I just discovered that the _ character is a word separator in the  
>> StandardAnalyzer.
>> Can it be?
>> It broke our usage of a field that stores a comma-separated list of  
>> "uri-fragments"
>
> If I were analysing a URI, I would not be using StandardAnalyser,  
> but something that splits only on what is special for a URI.  You  
> wouldn't even want to break on a hyphen, normally.
>
> In your case, you are breaking it up already so you could just make  
> that your analyser.  Or if you want to keep breaking it up before it  
> gets put into Lucene, wouldn't a trivial analyser which breaks on  
> commas be the way to go?


Re: underscore a word separator in StandardAnalyzer?

Posted by Daniel Noll <da...@nuix.com>.
Paul Libbrecht wrote:
> 
> Hello fellows of Lucene,
> 
> I just discovered that the _ character is a word separator in the 
> StandardAnalyzer.
> Can it be?
> It broke our usage of a field that stores a comma-separated list of 
> "uri-fragments"

If I were analysing a URI, I would not be using StandardAnalyser, but 
something that splits only on what is special for a URI.  You wouldn't 
even want to break on a hyphen, normally.

In your case, you are breaking it up already so you could just make that 
your analyser.  Or if you want to keep breaking it up before it gets put 
into Lucene, wouldn't a trivial analyser which breaks on commas be the 
way to go?

Daniel


-- 
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org