You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Daniel Bigham <da...@wolfram.com> on 2016/06/01 16:56:59 UTC

analyzers-common VS analyzers-icu

Hi,

I recently setup my code to choose the appropriate analyzer from 
analyzers-common depending on the language of the user's index/field.   
I then extended the existing source code to allow, for any language, to 
turn on/off things like stemming, case sensitivity, etc.

Today I discovered analyzers-icu, and I don't understand how to 
understand analyzers-common VS analyzers-icu.

Are they drop in replacements of each other?  Are there features in one 
that aren't available in the other?  What are the pros and cons of using 
one or the other?

In a nutshell, the features I care about are:

- The ability to specify a language and have tokenization performed 
according to that language
- Obviously the more languages supported the better
- The ability to turn on/off stemming for any language (implemented 
myself for analyzers-common)
- The ability to turn on/off case sensitivity for any language 
(implemented myself for analyzers-common)

Thanks,
Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: analyzers-common VS analyzers-icu

Posted by Daniel Bigham <da...@wolfram.com>.

Any other replies to this? Timothy's response was somewhat helpful but hasn't answered in an authoritative way what the current status of these two different "forks" of language analyzers is. Surely there is some history here and some high level status about them? (perhaps I should look at git and try to figure out who the main developers have been?) 

----- On Jun 1, 2016, at 3:04 PM, Allison, Timothy B. <ta...@mitre.org> wrote: 

> That package has an ICU tokenizer and the ICUFoldingFilter.

> The ICUFoldingFilter does advanced (well, Unicode compliant) case
> folding/lowercasing/normalization and is critical for non-ascii languages. You
> can use that in place of the AsciiFoldingFilter and the LowerCaseFilter, and it
> should be far more robust on Arabic script/CJK languages.

> I can't speak for how swapping in the ICUFoldingFilter may adversely affect your
> analysis chain (there may be surprises with stemmers or stop lists that were
> designed without it), but, generally, that's a really important filter.

> I haven't looked deeply into the diffs between the StandardTokenizer and the
> ICUTokenizer and can't speak to that.

> -----Original Message-----
> From: Daniel Bigham [mailto:danielb@wolfram.com]
> Sent: Wednesday, June 1, 2016 12:57 PM
> To: java-user@lucene.apache.org
> Subject: analyzers-common VS analyzers-icu

> Hi,

> I recently setup my code to choose the appropriate analyzer from
> analyzers-common depending on the language of the user's index/field.
> I then extended the existing source code to allow, for any language, to turn
> on/off things like stemming, case sensitivity, etc.

> Today I discovered analyzers-icu, and I don't understand how to understand
> analyzers-common VS analyzers-icu.

> Are they drop in replacements of each other? Are there features in one that
> aren't available in the other? What are the pros and cons of using one or the
> other?

> In a nutshell, the features I care about are:

> - The ability to specify a language and have tokenization performed according to
> that language
> - Obviously the more languages supported the better
> - The ability to turn on/off stemming for any language (implemented myself for
> analyzers-common)
> - The ability to turn on/off case sensitivity for any language (implemented
> myself for analyzers-common)

> Thanks,
> Daniel

> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

RE: analyzers-common VS analyzers-icu

Posted by "Allison, Timothy B." <ta...@mitre.org>.

That package has an ICU tokenizer and the ICUFoldingFilter.  

The ICUFoldingFilter does advanced (well, Unicode compliant) case folding/lowercasing/normalization and is critical for non-ascii languages.  You can use that in place of the AsciiFoldingFilter and the LowerCaseFilter, and it should be far more robust on Arabic script/CJK languages.

I can't speak for how swapping in the ICUFoldingFilter may adversely affect your analysis chain (there may be surprises with stemmers or stop lists that were designed without it), but, generally, that's a really important filter.

I haven't looked deeply into the diffs between the StandardTokenizer and the ICUTokenizer and can't speak to that.

-----Original Message-----
From: Daniel Bigham [mailto:danielb@wolfram.com] 
Sent: Wednesday, June 1, 2016 12:57 PM
To: java-user@lucene.apache.org
Subject: analyzers-common VS analyzers-icu

Hi,

I recently setup my code to choose the appropriate analyzer from 
analyzers-common depending on the language of the user's index/field.   
I then extended the existing source code to allow, for any language, to turn on/off things like stemming, case sensitivity, etc.

Today I discovered analyzers-icu, and I don't understand how to understand analyzers-common VS analyzers-icu.

Are they drop in replacements of each other?  Are there features in one that aren't available in the other?  What are the pros and cons of using one or the other?

In a nutshell, the features I care about are:

- The ability to specify a language and have tokenization performed according to that language
- Obviously the more languages supported the better
- The ability to turn on/off stemming for any language (implemented myself for analyzers-common)
- The ability to turn on/off case sensitivity for any language (implemented myself for analyzers-common)

Thanks,
Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org