You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Libbrecht <pa...@activemath.org> on 2009/03/14 23:36:28 UTC
underscore a word separator in StandardAnalyzer?
Hello fellows of Lucene,
I just discovered that the _ character is a word separator in the
StandardAnalyzer.
Can it be?
It broke our usage of a field that stores a comma-separated list of
"uri-fragments" which, of course, contain _: the standard-analyzer
splits these as separate term which fully-fuzzifies the search.
Is there any rationale? A past debate about that?
I would feel my candid approach to be rather common: underscore makes
new words out of existing words, dash makes composed words.
I sure know I can try to adapt standard-analyzer! I wanted to know the
reasons.
paul
Re: underscore a word separator in StandardAnalyzer?
Posted by Paul Libbrecht <pa...@activemath.org>.
Sure, all this is possible, I would know how to make my analyzer.
I just faced this on an existing solution which takes StandardAnalyzer
for almost everything because it's generic. Therefore I wanted to know
if there was a rationale.
Of cours URIs have trivial analyzers.
paul
Le 16-mars-09 à 00:03, Daniel Noll a écrit :
> Paul Libbrecht wrote:
>> Hello fellows of Lucene,
>> I just discovered that the _ character is a word separator in the
>> StandardAnalyzer.
>> Can it be?
>> It broke our usage of a field that stores a comma-separated list of
>> "uri-fragments"
>
> If I were analysing a URI, I would not be using StandardAnalyser,
> but something that splits only on what is special for a URI. You
> wouldn't even want to break on a hyphen, normally.
>
> In your case, you are breaking it up already so you could just make
> that your analyser. Or if you want to keep breaking it up before it
> gets put into Lucene, wouldn't a trivial analyser which breaks on
> commas be the way to go?
Re: underscore a word separator in StandardAnalyzer?
Posted by Daniel Noll <da...@nuix.com>.
Paul Libbrecht wrote:
>
> Hello fellows of Lucene,
>
> I just discovered that the _ character is a word separator in the
> StandardAnalyzer.
> Can it be?
> It broke our usage of a field that stores a comma-separated list of
> "uri-fragments"
If I were analysing a URI, I would not be using StandardAnalyser, but
something that splits only on what is special for a URI. You wouldn't
even want to break on a hyphen, normally.
In your case, you are breaking it up already so you could just make that
your analyser. Or if you want to keep breaking it up before it gets put
into Lucene, wouldn't a trivial analyser which breaks on commas be the
way to go?
Daniel
--
Daniel Noll Forensic and eDiscovery Software
Senior Developer The world's most advanced
Nuix email data analysis
http://nuix.com/ and eDiscovery software
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org