You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Marcus Rau <ma...@meta-level.de> on 2004/07/29 12:47:36 UTC

Allow non letter characters in tokens

Hi there,

my question is a pretty short one!

How can I prevent Lucene from cutting out special characters (i.e. the 
"_") during tokenization of a text? It's quite essential for me to have 
some non letter chars in my index.

Regards
Marcus


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Allow non letter characters in tokens

Posted by Otis Gospodnetic <ot...@yahoo.com>.
WhitespaceAnalyzer breaks input on spaces.

Otis

--- Rupinder Singh Mazara <rs...@ebi.ac.uk> wrote:

> Hi
>  thanks for the reply
> >>   my dataset also seems to have a similar problem the chemical
> name
> >> alpha-androstane-3, and several others exsists in the given text,
> can
> anyone point out what is the best stratergy
> >> to employ so as to index
> >>   words containing - _ +  to be indexed as they are and not face
> being
> mutilated ?
> >
> >You have to use or write an Analyzer that doesn't tokenize on
> >non-letter or other characters.
> 
> Are there any built in analyzers that do that ?
> 
> >>   currently on my indexes the StandardAnalyzer and QueryParser 
> break
> >> up
> >> alpha-androstane-3
> >>   into TEXT:alpha -TEXT:androstane -TEXT:3 , where TEXT is the
> Field
> >> to be
> >> searched
> >
> >Hm, I thought we've fixed QueryParser not to do this.  Are you using
> >Lucene 1.4?
> no, i guess I will have to
> 
> Rupinder
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Allow non letter characters in tokens

Posted by Rupinder Singh Mazara <rs...@ebi.ac.uk>.
Hi
 thanks for the reply
>>   my dataset also seems to have a similar problem the chemical name
>> alpha-androstane-3, and several others exsists in the given text, can
anyone point out what is the best stratergy
>> to employ so as to index
>>   words containing - _ +  to be indexed as they are and not face being
mutilated ?
>
>You have to use or write an Analyzer that doesn't tokenize on
>non-letter or other characters.

Are there any built in analyzers that do that ?

>>   currently on my indexes the StandardAnalyzer and QueryParser  break
>> up
>> alpha-androstane-3
>>   into TEXT:alpha -TEXT:androstane -TEXT:3 , where TEXT is the Field
>> to be
>> searched
>
>Hm, I thought we've fixed QueryParser not to do this.  Are you using
>Lucene 1.4?
no, i guess I will have to

Rupinder


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Allow non letter characters in tokens

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello,

>   my dataset also seems to have a similar problem the chemical name
> alpha-androstane-3, and several others exsists
>   in the given text, can anyone point out what is the best stratergy
> to
> employ so as to index
>   words containing - _ +  to be indexed as they are and not face
> being
> mutilated ?

You have to use or write an Analyzer that doesn't tokenize on
non-letter or other characters.

>   currently on my indexes the StandardAnalyzer and QueryParser  break
> up
> alpha-androstane-3
>   into TEXT:alpha -TEXT:androstane -TEXT:3 , where TEXT is the Field
> to be
> searched

Hm, I thought we've fixed QueryParser not to do this.  Are you using
Lucene 1.4?

Otis

>   If a enclose alpha-androstane-3 as a phrase "alpha-androstane-3"
> then the
> QueryParser
> breaks is down to ABSTRACT:"alpha androstane-3"  , some how the first
> "-"
> disapears  ?
> 
> 
>  regards
> 
>  Rupinder
> 
> >-----Original Message-----
> >From: Marcus Rau [mailto:marcus.rau@meta-level.de]
> >Sent: 29 July 2004 11:48
> >To: lucene-user@jakarta.apache.org
> >Subject: Allow non letter characters in tokens
> >
> >
> >Hi there,
> >
> >my question is a pretty short one!
> >
> >How can I prevent Lucene from cutting out special characters (i.e.
> the
> >"_") during tokenization of a text? It's quite essential for me to
> have
> >some non letter chars in my index.
> >
> >Regards
> >Marcus
> >
> >
>
>---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Allow non letter characters in tokens

Posted by Rupinder Singh Mazara <rs...@ebi.ac.uk>.
Hi all

  my dataset also seems to have a similar problem the chemical name
alpha-androstane-3, and several others exsists
  in the given text, can anyone point out what is the best stratergy to
employ so as to index
  words containing - _ +  to be indexed as they are and not face being
mutilated ?


  currently on my indexes the StandardAnalyzer and QueryParser  break up
alpha-androstane-3
  into TEXT:alpha -TEXT:androstane -TEXT:3 , where TEXT is the Field to be
searched

  If a enclose alpha-androstane-3 as a phrase "alpha-androstane-3" then the
QueryParser
breaks is down to ABSTRACT:"alpha androstane-3"  , some how the first "-"
disapears  ?


 regards

 Rupinder

>-----Original Message-----
>From: Marcus Rau [mailto:marcus.rau@meta-level.de]
>Sent: 29 July 2004 11:48
>To: lucene-user@jakarta.apache.org
>Subject: Allow non letter characters in tokens
>
>
>Hi there,
>
>my question is a pretty short one!
>
>How can I prevent Lucene from cutting out special characters (i.e. the
>"_") during tokenization of a text? It's quite essential for me to have
>some non letter chars in my index.
>
>Regards
>Marcus
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org