You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Trejkaz <tr...@trypticon.org> on 2011/01/17 01:37:12 UTC

Unicode normalisation *before* tokenisation?

Hi all.

I discovered there is a normalise filter now, using ICU's Normalizer2
(org.apache.lucene.analysis.icu.ICUNormalizer2Filter).  However, as
this is a filter, various problems can result if used with
StandardTokenizer.

One in particular is half-width Katakana.

Supposing you start out with the following (four Java chars):

    パパ

StandardTokenizer will break this up into four separate tokens of one
char each.  (I can't show an example as the combining handakuten
doesn't actually render by itself on my system.)

Passing this through the normalising filter converts it back to normal
full-width Katakana, but unfortunately still generates four separate
tokens.

Thinking that I could avoid this problem if I filtered the content
*before* the tokeniser got to it, I wrote a NormalisingReader and
passed the text through this before the tokeniser could get its hands
on it.  I figured this would also be faster, as normalisation could be
done in chunks of a few kilobytes instead of smaller chunks.
Unfortunately, it doesn't give usable results, because the text
offsets the tokeniser reports are relative to the normalising reader,
not the original text.  This quickly caused issues when trying to
highlight the hits.

There are alternative workarounds for the specific issue of half-width
Katakana, of course:
    1. Write a filter which joins tokens containing a dakuten or
handakuten with the previous token.
    2. Modify the tokeniser itself to make it output such pairs as
single tokens.

I am currently reluctant to either of these, as there are other issues
with not normalising up-front.  For instance, if "½" appeared, it
would be indexed as one token "1/2", and someone searching for it by
typing "1/2" would not find it, as "1/2" would be analysed as two
tokens ("1","2".)  Writing a general filter to join together tokens
which normalise with each other (and split apart those which decompose
into a sequence which won't recompose into anything) seems to be a
significantly difficult task.

So I guess I have two questions:
    1. Is there some way to do filtering to the text before
tokenisation without upsetting the offsets reported by the tokeniser?
    2. Is there some more general solution to this problem, such as an
existing tokeniser similar to StandardTokeniser but with better
Unicode awareness?

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Unicode normalisation *before* tokenisation?

Posted by Trejkaz <tr...@trypticon.org>.
On Mon, Jan 17, 2011 at 11:53 AM, Robert Muir <rc...@gmail.com> wrote:
> On Sun, Jan 16, 2011 at 7:37 PM, Trejkaz <tr...@trypticon.org> wrote:
>> So I guess I have two questions:
>>    1. Is there some way to do filtering to the text before
>> tokenisation without upsetting the offsets reported by the tokeniser?
>>    2. Is there some more general solution to this problem, such as an
>> existing tokeniser similar to StandardTokeniser but with better
>> Unicode awareness?
>>
>
> Hi, I think you want to try the StandardTokenizer in 3.1 (make sure
> you pass Version.LUCENE_31 to get the new behavior)
> It implements UAX#29 algorithm which respects canonical equivalence...
> it sounds like thats what you want.

This does sound like what we want, although it sounds like it might
take time to first identify whether UAX#29 will break the text the way
we want it (there aren't any solid examples of how the algorithm works
on different kinds of text in the standard itself, which is a bit
unfortunate.)

The other problem is that we're still stuck on 2.9 due to having
deprecated features in our codebase still, and having very little time
to do anything about it.  Moving to the new API is taking a while, as
some of those API changes are quite tricky to refactor for
(TokenStream in particular, makes fixing a single class take half a
day, once you add the time to verify that it is working correctly.)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Unicode normalisation *before* tokenisation?

Posted by Robert Muir <rc...@gmail.com>.
On Sun, Jan 16, 2011 at 7:37 PM, Trejkaz <tr...@trypticon.org> wrote:
> So I guess I have two questions:
>    1. Is there some way to do filtering to the text before
> tokenisation without upsetting the offsets reported by the tokeniser?
>    2. Is there some more general solution to this problem, such as an
> existing tokeniser similar to StandardTokeniser but with better
> Unicode awareness?
>

Hi, I think you want to try the StandardTokenizer in 3.1 (make sure
you pass Version.LUCENE_31 to get the new behavior)
It implements UAX#29 algorithm which respects canonical equivalence...
it sounds like thats what you want.

http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org