You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Marcus Rau <ma...@meta-level.de> on 2004/07/29 12:47:36 UTC
Allow non letter characters in tokens
Hi there,
my question is a pretty short one!
How can I prevent Lucene from cutting out special characters (i.e. the
"_") during tokenization of a text? It's quite essential for me to have
some non letter chars in my index.
Regards
Marcus
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
RE: Allow non letter characters in tokens
Posted by Otis Gospodnetic <ot...@yahoo.com>.
WhitespaceAnalyzer breaks input on spaces.
Otis
--- Rupinder Singh Mazara <rs...@ebi.ac.uk> wrote:
> Hi
> thanks for the reply
> >> my dataset also seems to have a similar problem the chemical
> name
> >> alpha-androstane-3, and several others exsists in the given text,
> can
> anyone point out what is the best stratergy
> >> to employ so as to index
> >> words containing - _ + to be indexed as they are and not face
> being
> mutilated ?
> >
> >You have to use or write an Analyzer that doesn't tokenize on
> >non-letter or other characters.
>
> Are there any built in analyzers that do that ?
>
> >> currently on my indexes the StandardAnalyzer and QueryParser
> break
> >> up
> >> alpha-androstane-3
> >> into TEXT:alpha -TEXT:androstane -TEXT:3 , where TEXT is the
> Field
> >> to be
> >> searched
> >
> >Hm, I thought we've fixed QueryParser not to do this. Are you using
> >Lucene 1.4?
> no, i guess I will have to
>
> Rupinder
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
RE: Allow non letter characters in tokens
Posted by Rupinder Singh Mazara <rs...@ebi.ac.uk>.
Hi
thanks for the reply
>> my dataset also seems to have a similar problem the chemical name
>> alpha-androstane-3, and several others exsists in the given text, can
anyone point out what is the best stratergy
>> to employ so as to index
>> words containing - _ + to be indexed as they are and not face being
mutilated ?
>
>You have to use or write an Analyzer that doesn't tokenize on
>non-letter or other characters.
Are there any built in analyzers that do that ?
>> currently on my indexes the StandardAnalyzer and QueryParser break
>> up
>> alpha-androstane-3
>> into TEXT:alpha -TEXT:androstane -TEXT:3 , where TEXT is the Field
>> to be
>> searched
>
>Hm, I thought we've fixed QueryParser not to do this. Are you using
>Lucene 1.4?
no, i guess I will have to
Rupinder
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
RE: Allow non letter characters in tokens
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello,
> my dataset also seems to have a similar problem the chemical name
> alpha-androstane-3, and several others exsists
> in the given text, can anyone point out what is the best stratergy
> to
> employ so as to index
> words containing - _ + to be indexed as they are and not face
> being
> mutilated ?
You have to use or write an Analyzer that doesn't tokenize on
non-letter or other characters.
> currently on my indexes the StandardAnalyzer and QueryParser break
> up
> alpha-androstane-3
> into TEXT:alpha -TEXT:androstane -TEXT:3 , where TEXT is the Field
> to be
> searched
Hm, I thought we've fixed QueryParser not to do this. Are you using
Lucene 1.4?
Otis
> If a enclose alpha-androstane-3 as a phrase "alpha-androstane-3"
> then the
> QueryParser
> breaks is down to ABSTRACT:"alpha androstane-3" , some how the first
> "-"
> disapears ?
>
>
> regards
>
> Rupinder
>
> >-----Original Message-----
> >From: Marcus Rau [mailto:marcus.rau@meta-level.de]
> >Sent: 29 July 2004 11:48
> >To: lucene-user@jakarta.apache.org
> >Subject: Allow non letter characters in tokens
> >
> >
> >Hi there,
> >
> >my question is a pretty short one!
> >
> >How can I prevent Lucene from cutting out special characters (i.e.
> the
> >"_") during tokenization of a text? It's quite essential for me to
> have
> >some non letter chars in my index.
> >
> >Regards
> >Marcus
> >
> >
>
>---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
RE: Allow non letter characters in tokens
Posted by Rupinder Singh Mazara <rs...@ebi.ac.uk>.
Hi all
my dataset also seems to have a similar problem the chemical name
alpha-androstane-3, and several others exsists
in the given text, can anyone point out what is the best stratergy to
employ so as to index
words containing - _ + to be indexed as they are and not face being
mutilated ?
currently on my indexes the StandardAnalyzer and QueryParser break up
alpha-androstane-3
into TEXT:alpha -TEXT:androstane -TEXT:3 , where TEXT is the Field to be
searched
If a enclose alpha-androstane-3 as a phrase "alpha-androstane-3" then the
QueryParser
breaks is down to ABSTRACT:"alpha androstane-3" , some how the first "-"
disapears ?
regards
Rupinder
>-----Original Message-----
>From: Marcus Rau [mailto:marcus.rau@meta-level.de]
>Sent: 29 July 2004 11:48
>To: lucene-user@jakarta.apache.org
>Subject: Allow non letter characters in tokens
>
>
>Hi there,
>
>my question is a pretty short one!
>
>How can I prevent Lucene from cutting out special characters (i.e. the
>"_") during tokenization of a text? It's quite essential for me to have
>some non letter chars in my index.
>
>Regards
>Marcus
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org