You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2011/06/15 01:07:14 UTC
International filters/tokenizers doing too much
Because the text in my index comes in many different languages with no
ability to know the language ahead of time, I have a need to use
ICUTokenizer and/or the CJK filters, but I have a problem with them as
they are implemented currently. They do extra things like handle email
addresses, tokenize on non-alphanumeric characters, etc. I need them to
not do these things. This is my current index analyzer chain:
http://pastebin.com/dNBGmeeW
My current idea for how to change this is to use the ICUTokenizer
instead of the WhitespaceTokenizer, then as one of the later steps, run
it through CJK so that it outputs bigrams for the CJK characters. The
reason I can't do this now is that I must let WordDelimiterFilter handle
punctuation, case changes, and numbers, because of the magic of the
preserveOriginal flag.
Is it possible to turn off these extra features in these analyzer
components as they are written now? If not, is it a painful process for
someone with Java experience to customize the code so it IS possible? I
have not yet looked at the code, but I will do so in the next couple of
days. Ideally, I would also like to have a WordDelimiterFilter that is
fully aware of international capitalization via ICU. Does any such
thing exist?
In the current chain, you'll notice a pattern filter. What this does is
remove leading and trailing punctuation from tokens. Punctuation inside
the token is preserved, for later handling with WordDelimiterFilter.
Thanks,
Shawn
Re: International filters/tokenizers doing too much
Posted by Shawn Heisey <so...@elyograg.org>.
On 6/14/2011 5:34 PM, Robert Muir wrote:
> On Tue, Jun 14, 2011 at 7:07 PM, Shawn Heisey<so...@elyograg.org> wrote:
>> Because the text in my index comes in many different languages with no
>> ability to know the language ahead of time, I have a need to use
>> ICUTokenizer and/or the CJK filters, but I have a problem with them as they
>> are implemented currently. They do extra things like handle email
>> addresses, tokenize on non-alphanumeric characters, etc. I need them to not
>> do these things. This is my current index analyzer chain:
> the idea is that you customize it to whatever your app needs, by
> passing ICUTokenizerConfig to the Tokenizer:
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig.java
>
> the default implementation (DefaultICUTokenizerConfig) is pretty
> minimal, mostly the unicode default word break implementation,
> described here: http://unicode.org/reports/tr29/
>
> as you see, you just need to provide a BreakIterator given the script
> code, you could implement this by hand in java code, or it could use a
> dictionary, or whatever.
>
> But the easiest and usually most performant is just to use rules,
> especially since they are compiled to an efficient form for
> processing, the syntax is described here:
> http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules
>
> you compile them into a state machine with this:
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/RBBIRuleCompiler.java
> and you can load the serialized form (statically, or in your factory,
> or whatever) with
> http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedBreakIterator.html#getInstanceFromCompiledRules%28java.io.InputStream%29
>
> the reason the script code is provided, is because if you are
> customizing, its pretty easy to screw some languages over with some
> rules that might happen to work well for another set of languages.
> so this way you can provide different rules depending upon the writing system.
>
> for example you could return special punctuation rules for western
> languages when its the latin script, but still return the default impl
> for Tibetan or something you might be less familiar with (maybe you
> actually speak Tibetan, this was just an example).
My understanding starts to break down horribly with things like this. I
can make sense out of very simple Java code, but I can't make sense out
of this, and don't know how to take these bits of information you've
given me and do something useful with them. I will take the information
to our programming team before I bug you about it again. They will
probably have some idea what to do. I'm hoping that I can just create
an extra .jar and not touch the existing lucene/solr code.
Beyond the ICU stuff, what kind of options do I have for dealing with
other character sets (CJK, arabic, cyrillic, etc) in some sane manner
while not touching typical Latin punctuation? I notice that for CJK,
there is only a Tokenizer and an Analyzer, what I really need is a token
filter that ONLY deals with the CJK characters. Is that going to be a
major undertaking that is best handled by an experienced Lucene
developer? Would such a thing be required for Arabic and Cyrillic, or
are they pretty well covered by whitespace and WDF?
Thanks,
Shawn
Re: International filters/tokenizers doing too much
Posted by Robert Muir <rc...@gmail.com>.
On Tue, Jun 14, 2011 at 7:07 PM, Shawn Heisey <so...@elyograg.org> wrote:
> Because the text in my index comes in many different languages with no
> ability to know the language ahead of time, I have a need to use
> ICUTokenizer and/or the CJK filters, but I have a problem with them as they
> are implemented currently. They do extra things like handle email
> addresses, tokenize on non-alphanumeric characters, etc. I need them to not
> do these things. This is my current index analyzer chain:
the idea is that you customize it to whatever your app needs, by
passing ICUTokenizerConfig to the Tokenizer:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig.java
the default implementation (DefaultICUTokenizerConfig) is pretty
minimal, mostly the unicode default word break implementation,
described here: http://unicode.org/reports/tr29/
as you see, you just need to provide a BreakIterator given the script
code, you could implement this by hand in java code, or it could use a
dictionary, or whatever.
But the easiest and usually most performant is just to use rules,
especially since they are compiled to an efficient form for
processing, the syntax is described here:
http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules
you compile them into a state machine with this:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/RBBIRuleCompiler.java
and you can load the serialized form (statically, or in your factory,
or whatever) with
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedBreakIterator.html#getInstanceFromCompiledRules%28java.io.InputStream%29
the reason the script code is provided, is because if you are
customizing, its pretty easy to screw some languages over with some
rules that might happen to work well for another set of languages.
so this way you can provide different rules depending upon the writing system.
for example you could return special punctuation rules for western
languages when its the latin script, but still return the default impl
for Tibetan or something you might be less familiar with (maybe you
actually speak Tibetan, this was just an example).