You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Anna Hunecke <an...@yahoo.de> on 2010/06/17 15:31:34 UTC

Strange behaviour of StandardTokenizer

Hi!

I ran into a strange behaviour of the StandardTokenizer. Terms containing a '-' are tokenized differently depending on the context. 
For example, the term 'nl-lt' is split into 'nl' and 'lt'.
The term 'nl-lt0' is tokenized into 'nl-lt0'.
Is this a bug or a feature? Can I avoid it somehow?
I'm using Lucene 3.0.0.

Best,
Anna



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Strange behaviour of StandardTokenizer

Posted by Anna Hunecke <an...@yahoo.de>.

Hi!

Basically, what I want is something that removes punctuation. 
But I realized now that things like email or number recognition are also very useful if I want to give suggestions. I want to be able to give 'nl-lt001' as a suggestion when the user enters 'nl'. This would of course not be possible if the tokenizer just blindly splits at the '-'. 
So, I'll stick with the tokenizer for now and fix the problems I had with the splitting of words by building the queries differently.
Thanks for your help!

- Anna

--- Simon Willnauer <si...@googlemail.com> schrieb am Fr, 18.6.2010:

> Von: Simon Willnauer <si...@googlemail.com>
> Betreff: Re: Strange behaviour of StandardTokenizer
> An: java-user@lucene.apache.org
> Datum: Freitag, 18. Juni, 2010 09:52 Uhr
> Hi Anna,
> 
> what are you using you tokenizer for? There are a lot of
> different
> options in lucene an StandardTokenizer is not necessarily
> the best
> one. The behaviour you are see is that the tokenizer
> detects you token
> as a number. When you look at the grammar that is kind of
> obvious.
> 
> <snip>
> // floating point, serial, model numbers, ip addresses,
> etc.
> // every other segment must have at least one digit
> NUM        = ({ALPHANUM} {P}
> {HAS_DIGIT}
>            | {HAS_DIGIT}
> {P} {ALPHANUM}
>            | {ALPHANUM}
> ({P} {HAS_DIGIT} {P} {ALPHANUM})+
>            | {HAS_DIGIT}
> ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>            | {ALPHANUM}
> {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>            | {HAS_DIGIT}
> {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
> 
> // punctuation
> P         
>    = ("_"|"-"|"/"|"."|",")
> 
> </snip>
> 
> you can either build your own custom filter which fixed
> only the
> problem with numbers containing a '- ', use the
> MappingCharFilter or
> switch to a different tokenizer.
> If you could talk more about your usecase you might get
> better suggestions.
> 
> Simon
> 
> On Fri, Jun 18, 2010 at 9:03 AM, Anna Hunecke <an...@yahoo.de>
> wrote:
> > Hi Ahmet,
> > thanks for the explanation. :)
> > okay, so it is recognized as a number? I didn't expect
> that really. I expect that all words are either split at the
> minus or not.
> > Maybe I'll have to use another tokenizer.
> > Best,
> > Anna
> >
> > --- Ahmet Arslan <io...@yahoo.com>
> schrieb am Do, 17.6.2010:
> >
> >> Von: Ahmet Arslan <io...@yahoo.com>
> >> Betreff: Re: Strange behaviour of
> StandardTokenizer
> >> An: java-user@lucene.apache.org
> >> Datum: Donnerstag, 17. Juni, 2010 15:50 Uhr
> >>
> >> > I ran into a strange behaviour of the
> >> StandardTokenizer.
> >> > Terms containing a '-' are tokenized
> differently
> >> depending
> >> > on the context.
> >> > For example, the term 'nl-lt' is split into
> 'nl' and
> >> 'lt'.
> >> > The term 'nl-lt0' is tokenized into
> 'nl-lt0'.
> >> > Is this a bug or a feature?
> >>
> >> It is designed that way. TypeAttribute of those
> tokens are
> >> different.
> >>
> >> > Can I avoid it somehow?
> >>
> >> Do you want to split at '-' char no matter what?
> If yes,
> >> you can replace all '-' characters with whitespace
> using
> >> MappingCharFilter before StandardTokenizer.
> >>
> >>
> >>
> >>
> >>
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Strange behaviour of StandardTokenizer

Posted by Simon Willnauer <si...@googlemail.com>.

Hi Anna,

what are you using you tokenizer for? There are a lot of different
options in lucene an StandardTokenizer is not necessarily the best
one. The behaviour you are see is that the tokenizer detects you token
as a number. When you look at the grammar that is kind of obvious.

<snip>
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
           | {HAS_DIGIT} {P} {ALPHANUM}
           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)

// punctuation
P	         = ("_"|"-"|"/"|"."|",")

</snip>

you can either build your own custom filter which fixed only the
problem with numbers containing a '- ', use the MappingCharFilter or
switch to a different tokenizer.
If you could talk more about your usecase you might get better suggestions.

Simon

On Fri, Jun 18, 2010 at 9:03 AM, Anna Hunecke <an...@yahoo.de> wrote:
> Hi Ahmet,
> thanks for the explanation. :)
> okay, so it is recognized as a number? I didn't expect that really. I expect that all words are either split at the minus or not.
> Maybe I'll have to use another tokenizer.
> Best,
> Anna
>
> --- Ahmet Arslan <io...@yahoo.com> schrieb am Do, 17.6.2010:
>
>> Von: Ahmet Arslan <io...@yahoo.com>
>> Betreff: Re: Strange behaviour of StandardTokenizer
>> An: java-user@lucene.apache.org
>> Datum: Donnerstag, 17. Juni, 2010 15:50 Uhr
>>
>> > I ran into a strange behaviour of the
>> StandardTokenizer.
>> > Terms containing a '-' are tokenized differently
>> depending
>> > on the context.
>> > For example, the term 'nl-lt' is split into 'nl' and
>> 'lt'.
>> > The term 'nl-lt0' is tokenized into 'nl-lt0'.
>> > Is this a bug or a feature?
>>
>> It is designed that way. TypeAttribute of those tokens are
>> different.
>>
>> > Can I avoid it somehow?
>>
>> Do you want to split at '-' char no matter what? If yes,
>> you can replace all '-' characters with whitespace using
>> MappingCharFilter before StandardTokenizer.
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Strange behaviour of StandardTokenizer

Posted by Ahmet Arslan <io...@yahoo.com>.

> okay, so it is recognized as a number? 

Yes. You can see token type definitions in *.jflex file.

> Maybe I'll have to use another tokenizer.

MappingCharFilter with StandardTokenizer option exists.

NormalizeCharMap map = new NormalizeCharMap();
map.add("-", " "); 

TokenStream stream = new StandardTokenizer(
        new MappingCharFilter(map,
        new StringReader("nl-lt0"))); 




      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Strange behaviour of StandardTokenizer

Posted by Anna Hunecke <an...@yahoo.de>.

Hi Ahmet,
thanks for the explanation. :)
okay, so it is recognized as a number? I didn't expect that really. I expect that all words are either split at the minus or not.
Maybe I'll have to use another tokenizer.
Best,
Anna

--- Ahmet Arslan <io...@yahoo.com> schrieb am Do, 17.6.2010:

> Von: Ahmet Arslan <io...@yahoo.com>
> Betreff: Re: Strange behaviour of StandardTokenizer
> An: java-user@lucene.apache.org
> Datum: Donnerstag, 17. Juni, 2010 15:50 Uhr
> 
> > I ran into a strange behaviour of the
> StandardTokenizer.
> > Terms containing a '-' are tokenized differently
> depending
> > on the context. 
> > For example, the term 'nl-lt' is split into 'nl' and
> 'lt'.
> > The term 'nl-lt0' is tokenized into 'nl-lt0'.
> > Is this a bug or a feature? 
> 
> It is designed that way. TypeAttribute of those tokens are
> different.
> 
> > Can I avoid it somehow?
> 
> Do you want to split at '-' char no matter what? If yes,
> you can replace all '-' characters with whitespace using
> MappingCharFilter before StandardTokenizer. 
> 
> 
>       
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Strange behaviour of StandardTokenizer

Posted by Ahmet Arslan <io...@yahoo.com>.

> I ran into a strange behaviour of the StandardTokenizer.
> Terms containing a '-' are tokenized differently depending
> on the context. 
> For example, the term 'nl-lt' is split into 'nl' and 'lt'.
> The term 'nl-lt0' is tokenized into 'nl-lt0'.
> Is this a bug or a feature? 

It is designed that way. TypeAttribute of those tokens are different.

> Can I avoid it somehow?

Do you want to split at '-' char no matter what? If yes, you can replace all '-' characters with whitespace using MappingCharFilter before StandardTokenizer. 


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org