You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Philip Puffinburger <pp...@tlcdelivers.com> on 2009/02/17 01:19:16 UTC

2.3.2 -> 2.4.0 StandardTokenizer issue

We have our own Analyzer which has the following

 

Public final TokenStream tokenStream(String fieldname, Reader reader) {

  TokenStream result = new StandardTokenizer(reader);

  result = new StandardFilter(result);

  result = new MyAccentFilter(result);

  result = new LowerCaseFilter(result);

  result = new StopFilter(result);

 

  return result;

}

 

In 2.3.2 if the token ‘Cómo’ came through this it would get changed to
‘como’ by the time it made it through the filters.    In 2.4.0 this isn’t
the case.   It treats this one token as two so we get ‘co’ and ‘mo’.    So
instead of search ‘como’ or ‘Cómo’ to get all the hits we now have to do
them separately.

 

I switched to the WhitespaceTokenizer as a test and that is indexing and
searching the way we expect it, but I haven’t looked into what we lost by
using that tokenizer.

 

Were we relying on a bug to get what we wanted from StandardTokenizer or did
something break in 2.4.0?

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

Posted by Philip Puffinburger <pp...@tlcdelivers.com>.

Actually, WhitespaceTokenizer won't work.   Too many person names and it
won't do anything with punctuation.   Something had to have changed in
StandardTokenizer, and we need some of the 2.4 fixes/features, so we are
kind of stuck.

-----Original Message-----
From: Philip Puffinburger [mailto:ppuffinburger@tlcdelivers.com] 
Sent: Monday, February 16, 2009 7:19 PM
To: java-user@lucene.apache.org
Subject: 2.3.2 -> 2.4.0 StandardTokenizer issue

We have our own Analyzer which has the following

Public final TokenStream tokenStream(String fieldname, Reader reader) {

  TokenStream result = new StandardTokenizer(reader);

  result = new StandardFilter(result);

  result = new MyAccentFilter(result);

  result = new LowerCaseFilter(result);

  result = new StopFilter(result);

  return result;

}

In 2.3.2 if the token ‘Cómo’ came through this it would get changed to
‘como’ by the time it made it through the filters.    In 2.4.0 this isn’t
the case.   It treats this one token as two so we get ‘co’ and ‘mo’.    So
instead of search ‘como’ or ‘Cómo’ to get all the hits we now have to do
them separately.

I switched to the WhitespaceTokenizer as a test and that is indexing and
searching the way we expect it, but I haven’t looked into what we lost by
using that tokenizer.

Were we relying on a bug to get what we wanted from StandardTokenizer or did
something break in 2.4.0?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

Posted by Philip Puffinburger <pp...@tlcdelivers.com>.

Thanks for the suggestion.   We're going to go over all of this information/suggestions next week to see what we want to do.

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Saturday, February 21, 2009 11:52 AM
To: java-user@lucene.apache.org
Subject: Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

that was just a suggestion as a quick hack...

it still won't really fix the problem because some character + accent
combinations don't have composed forms.

even if you added entire combining diacritical marks block to the jflex
grammar, its still wrong... what needs to be supported is \p{Word_Break =
Extend} property, etc etc.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

Posted by Robert Muir <rc...@gmail.com>.

that was just a suggestion as a quick hack...

it still won't really fix the problem because some character + accent
combinations don't have composed forms.

even if you added entire combining diacritical marks block to the jflex
grammar, its still wrong... what needs to be supported is \p{Word_Break =
Extend} property, etc etc.


On Sat, Feb 21, 2009 at 11:23 AM, Philip Puffinburger <
ppuffinburger@tlcdelivers.com> wrote:

> That's something we can try.   I don't know how much it performance we'd
> lose doing that as our custom filter has to decompose the tokens to do its
> operations.   So instead of 0..1 conversions we'd be doing 1..2 conversions
> during indexing and searching.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Saturday, February 21, 2009 8:35 AM
> To: java-user@lucene.apache.org
> Subject: Re: 2.3.2 -> 2.4.0 StandardTokenizer issue
>
> normalize your text to NFC. then it will be \u0043 \u00F3 \u006D \u006F and
> will work...
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

Posted by Philip Puffinburger <pp...@tlcdelivers.com>.

That's something we can try.   I don't know how much it performance we'd lose doing that as our custom filter has to decompose the tokens to do its operations.   So instead of 0..1 conversions we'd be doing 1..2 conversions during indexing and searching.

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Saturday, February 21, 2009 8:35 AM
To: java-user@lucene.apache.org
Subject: Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

normalize your text to NFC. then it will be \u0043 \u00F3 \u006D \u006F and
will work...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

Posted by Robert Muir <rc...@gmail.com>.

normalize your text to NFC. then it will be \u0043 \u00F3 \u006D \u006F and
will work...

On Fri, Feb 20, 2009 at 11:16 PM, Philip Puffinburger <
ppuffinburger@tlcdelivers.com> wrote:

> >some changes were made to the StandardTokenizer.jflex grammer (you can svn
> diff the two URLs fairly trivially) to better deal with correctly
> >identifying word characters, but from what i can tell that should have
> reduced the number of splits, not increased them.
> >
> >it's hard to tell from your email (because it was sent in the windows-1252
> >charset) but what exactly are the unicode characters you are putting
> through the tokenizer (ie: "\u0030") ?  knowing where it's splitting would
> >help figure out what's happening.
>
> These are the characters that of going through:
>
> \u0043 \u006F \u0301 \u006D \u006F - C o <Combining Acute Accent> m o
>
> It's splitting at the \u0301.
>
> >worst case scenerio, you could probably use the StandardTokenizer from
> >2.3.2 with the rest of the 2.4 code.
>
> We've thought of that, but would be the last thing we did to get it back to
> working.
>
> >this will show you exactly what changed...
> >svn diff >
> http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex>
> http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
>
> Thanks for the links.   I've never dealt with JFlex, so I'll have to do
> some reading to know what is going on in those files.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

Posted by Philip Puffinburger <pp...@tlcdelivers.com>.

>some changes were made to the StandardTokenizer.jflex grammer (you can svn diff the two URLs fairly trivially) to better deal with correctly >identifying word characters, but from what i can tell that should have reduced the number of splits, not increased them.
>
>it's hard to tell from your email (because it was sent in the windows-1252
>charset) but what exactly are the unicode characters you are putting through the tokenizer (ie: "\u0030") ?  knowing where it's splitting would >help figure out what's happening.

These are the characters that of going through:

\u0043 \u006F \u0301 \u006D \u006F - C o <Combining Acute Accent> m o

It's splitting at the \u0301.

>worst case scenerio, you could probably use the StandardTokenizer from
>2.3.2 with the rest of the 2.4 code.

We've thought of that, but would be the last thing we did to get it back to working.

>this will show you exactly what changed...
>svn diff >http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex >http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex

Thanks for the links.   I've never dealt with JFlex, so I'll have to do some reading to know what is going on in those files.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

Posted by Chris Hostetter <ho...@fucit.org>.

: In 2.3.2 if the token �Co�mo� came through this it would get changed to
: �como� by the time it made it through the filters.    In 2.4.0 this isn�t
: the case.   It treats this one token as two so we get �co� and �mo�.    So
: instead of search �como� or �Co�mo� to get all the hits we now have to do
: them separately.

some changes were made to the StandardTokenizer.jflex grammer (you can svn 
diff the two URLs fairly trivially) to better deal with correctly 
identifying word characters, but from what i can tell that should have 
reduced the number of splits, not increased them.

it's hard to tell from your email (because it was sent in the windows-1252 
charset) but what exactly are the unicode characters you are putting 
through the tokenizer (ie: "\u0030") ?  knowing where it's splitting would 
help figure out what's happening.

worst case scenerio, you could probably use the StandardTokenizer from 
2.3.2 with the rest of the 2.4 code.

this will show you exactly what changed...
svn diff http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex



-Hoss