You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2011/06/15 01:07:14 UTC

International filters/tokenizers doing too much

Because the text in my index comes in many different languages with no 
ability to know the language ahead of time, I have a need to use 
ICUTokenizer and/or the CJK filters, but I have a problem with them as 
they are implemented currently.  They do extra things like handle email 
addresses, tokenize on non-alphanumeric characters, etc.  I need them to 
not do these things.  This is my current index analyzer chain:

http://pastebin.com/dNBGmeeW

My current idea for how to change this is to use the ICUTokenizer 
instead of the WhitespaceTokenizer, then as one of the later steps, run 
it through CJK so that it outputs bigrams for the CJK characters.  The 
reason I can't do this now is that I must let WordDelimiterFilter handle 
punctuation, case changes, and numbers, because of the magic of the 
preserveOriginal flag.

Is it possible to turn off these extra features in these analyzer 
components as they are written now?  If not, is it a painful process for 
someone with Java experience to customize the code so it IS possible?  I 
have not yet looked at the code, but I will do so in the next couple of 
days.  Ideally, I would also like to have a WordDelimiterFilter that is 
fully aware of international capitalization via ICU.  Does any such 
thing exist?

In the current chain, you'll notice a pattern filter.  What this does is 
remove leading and trailing punctuation from tokens.  Punctuation inside 
the token is preserved, for later handling with WordDelimiterFilter.

Thanks,
Shawn

Re: International filters/tokenizers doing too much

Posted by Shawn Heisey <so...@elyograg.org>.

On 6/14/2011 5:34 PM, Robert Muir wrote:
> On Tue, Jun 14, 2011 at 7:07 PM, Shawn Heisey<so...@elyograg.org>  wrote:
>> Because the text in my index comes in many different languages with no
>> ability to know the language ahead of time, I have a need to use
>> ICUTokenizer and/or the CJK filters, but I have a problem with them as they
>> are implemented currently.  They do extra things like handle email
>> addresses, tokenize on non-alphanumeric characters, etc.  I need them to not
>> do these things.  This is my current index analyzer chain:
> the idea is that you customize it to whatever your app needs, by
> passing ICUTokenizerConfig to the Tokenizer:
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig.java
>
> the default implementation (DefaultICUTokenizerConfig) is pretty
> minimal, mostly the unicode default word break implementation,
> described here: http://unicode.org/reports/tr29/
>
> as you see, you just need to provide a BreakIterator given the script
> code, you could implement this by hand in java code, or it could use a
> dictionary, or whatever.
>
> But the easiest and usually most performant is just to use rules,
> especially since they are compiled to an efficient form for
> processing, the syntax is described here:
> http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules
>
> you compile them into a state machine with this:
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/RBBIRuleCompiler.java
> and you can load the serialized form (statically, or in your factory,
> or whatever) with
> http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedBreakIterator.html#getInstanceFromCompiledRules%28java.io.InputStream%29
>
> the reason the script code is provided, is because if you are
> customizing, its pretty easy to screw some languages over with some
> rules that might happen to work well for another set of languages.
> so this way you can provide different rules depending upon the writing system.
>
> for example you could return special punctuation rules for western
> languages when its the latin script, but still return the default impl
> for Tibetan or something you might be less familiar with (maybe you
> actually speak Tibetan, this was just an example).

My understanding starts to break down horribly with things like this.  I 
can make sense out of very simple Java code, but I can't make sense out 
of this, and don't know how to take these bits of information you've 
given me and do something useful with them.  I will take the information 
to our programming team before I bug you about it again.  They will 
probably have some idea what to do.  I'm hoping that I can just create 
an extra .jar and not touch the existing lucene/solr code.

Beyond the ICU stuff, what kind of options do I have for dealing with 
other character sets (CJK, arabic, cyrillic, etc) in some sane manner 
while not touching typical Latin punctuation?  I notice that for CJK, 
there is only a Tokenizer and an Analyzer, what I really need is a token 
filter that ONLY deals with the CJK characters.  Is that going to be a 
major undertaking that is best handled by an experienced Lucene 
developer?  Would such a thing be required for Arabic and Cyrillic, or 
are they pretty well covered by whitespace and WDF?

Thanks,
Shawn

Re: International filters/tokenizers doing too much

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Jun 14, 2011 at 7:07 PM, Shawn Heisey <so...@elyograg.org> wrote:
> Because the text in my index comes in many different languages with no
> ability to know the language ahead of time, I have a need to use
> ICUTokenizer and/or the CJK filters, but I have a problem with them as they
> are implemented currently.  They do extra things like handle email
> addresses, tokenize on non-alphanumeric characters, etc.  I need them to not
> do these things.  This is my current index analyzer chain:

the idea is that you customize it to whatever your app needs, by
passing ICUTokenizerConfig to the Tokenizer:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig.java

the default implementation (DefaultICUTokenizerConfig) is pretty
minimal, mostly the unicode default word break implementation,
described here: http://unicode.org/reports/tr29/

as you see, you just need to provide a BreakIterator given the script
code, you could implement this by hand in java code, or it could use a
dictionary, or whatever.

But the easiest and usually most performant is just to use rules,
especially since they are compiled to an efficient form for
processing, the syntax is described here:
http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules

you compile them into a state machine with this:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/RBBIRuleCompiler.java
and you can load the serialized form (statically, or in your factory,
or whatever) with
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedBreakIterator.html#getInstanceFromCompiledRules%28java.io.InputStream%29

the reason the script code is provided, is because if you are
customizing, its pretty easy to screw some languages over with some
rules that might happen to work well for another set of languages.
so this way you can provide different rules depending upon the writing system.

for example you could return special punctuation rules for western
languages when its the latin script, but still return the default impl
for Tibetan or something you might be less familiar with (maybe you
actually speak Tibetan, this was just an example).