You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by saisantoshi <sa...@gmail.com> on 2013/01/08 20:30:25 UTC

Is StandardAnalyzer good enough for multi languages...

DoesLucene StandardAnalyzer work for all the languagues for tokenizing before
indexing (since we are using java, I think the content is converted to UTF-8
before tokenizing/indeing)? or do we need to use special analyzers for each
of the language.  In this case, if a document has a mixed case ( english +
Japanese), what analyzer should we use and how can we figure it out
dynamically before indexing?

Also, while searching if the query text contains (both english and
Japanese), how does this work? Any criteria in choosing the analyzers?

Thanks,
Sai



--
View this message in context: http://lucene.472066.n3.nabble.com/Is-StandardAnalyzer-good-enough-for-multi-languages-tp4031660.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Is StandardAnalyzer good enough for multi languages...

Posted by Paul Hill <pa...@metajure.com>.
The ICU project ( http://site.icu-project.org/ ) has Analyzers for Lucene and it has been ported to ElasticSearch.  Maybe those integrate better.

As to not doing some tokenization, I would think an extra tokenizer in you chain would be just the thing.

-Paul

> -----Original Message-----
> From: Trejkaz [mailto:trejkaz@trypticon.org]
> Sent: Tuesday, January 08, 2013 3:44 PM
> To: java-user@lucene.apache.org
> Subject: Re: Is StandardAnalyzer good enough for multi languages...
> 
> On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi <sa...@gmail.com> wrote:
> > DoesLucene StandardAnalyzer work for all the languagues for tokenizing
> > before indexing (since we are using java, I think the content is
> > converted to UTF-8 before tokenizing/indeing)?
> 
> No. There are multiple cases where it chooses not to break something which it should break. Some of
> these cases even result in undesirable behaviour for English, so I would be surprised if there were even a
> single language which it handles acceptably.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is StandardAnalyzer good enough for multi languages...

Posted by saisantoshi <sa...@gmail.com>.
Thanks for all the responses. From the above, it sounds that there are two
options.

1. Use ICUTokenizer ( is it in Lucene 4.0 or 4.1)? If its in 4.1, then we
cannot use at this time as it is not released out.

2. Write a custom analyzer by extending ( StandardAnalyzer) and add filters
for additional languages. 

The problem that we are facing currently is described in detail at: 

http://lucene.472066.n3.nabble.com/Lucene-support-for-multi-byte-characters-2-4-0-version-td4031654.html
<http://lucene.472066.n3.nabble.com/Lucene-support-for-multi-byte-characters-2-4-0-version-td4031654.html>  
Just to summarize it, we are facing some issues tokenizing some Japanese
keyword characters (while uploading some documents, we have some keywords
where people can type in any language) and as a result, searching using such
specific keywords words is not working with the StandardAnalyzer (2.4.0
version).

Can you suggest any filter for this to integrate in Standard Analyzer?

Thanks,
Sai.



--
View this message in context: http://lucene.472066.n3.nabble.com/Is-StandardAnalyzer-good-enough-for-multi-languages-tp4031660p4031942.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Is StandardAnalyzer good enough for multi languages...

Posted by Paul Hill <pa...@metajure.com>.
There is often the possibility to put another tokenizer in the chain to create a variant analyzer.  This NOT very hard at all in either Lucene or ElasticSearch. 
Extra tokenizers can often be used to tweak the overall processing to add a late tokenization to overcome an overlooked tokenization (break on colon would be a simple example).  Adding a tokenizer before others can change a token that seem incorrectly  processed into one that is done how you like.

Trejkaz, I haven't tried to use ICU yet, but what I understand, I think you'll find that ICU is more in agreement with your views and embraces the idea of refining the tokenization etc. as needed, not relying on the curios (and often flawed) choices of some design committee somewhere.  

 [ICU]
> -----Original Message-----
> ... no specialisation for straight Roman script, but I guess it could
> always be added.

That would be one of the main points of the whole ICU infrastructure.

-Paul 



Re: Is StandardAnalyzer good enough for multi languages...

Posted by Trejkaz <tr...@trypticon.org>.
On Wed, Jan 9, 2013 at 5:25 PM, Steve Rowe <sa...@gmail.com> wrote:
> Dude.  Go look.  It allows for per-script specialization, with (non-UAX#29) specializations by default for Thai, Lao, Myanmar and Hewbrew.  See DefaultICUTokenizerConfig.  It's filled with exactly the opposite of what you were describing.

I guess that's a reasonable start. Still has no specialisation for
straight Roman script, but I guess it could always be added.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is StandardAnalyzer good enough for multi languages...

Posted by Steve Rowe <sa...@gmail.com>.
Dude.  Go look.  It allows for per-script specialization, with (non-UAX#29) specializations by default for Thai, Lao, Myanmar and Hewbrew.  See DefaultICUTokenizerConfig.  It's filled with exactly the opposite of what you were describing. 

ICUTokenizerFactory's customizability has been enhanced in to-be-released Lucene/Solr 4.1: <https://issues.apache.org/jira/browse/SOLR-4123> - you can provide per-script RuleBasedBreakIterator specification files at runtime. 

On Jan 9, 2013, at 12:12 AM, Trejkaz <tr...@trypticon.org> wrote:

> On Wed, Jan 9, 2013 at 10:57 AM, Steve Rowe <sa...@gmail.com> wrote:
>> Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be of interest to you, along with the token filters in that same module. - Steve
> 
> ICUTokenizer sounds like it's implementing UAX #29, which is exactly
> the standard filled with all the issues I was describing. Unless it
> does more than that, I would recommend against using that also.
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is StandardAnalyzer good enough for multi languages...

Posted by Trejkaz <tr...@trypticon.org>.
On Wed, Jan 9, 2013 at 10:57 AM, Steve Rowe <sa...@gmail.com> wrote:
> Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be of interest to you, along with the token filters in that same module. - Steve

ICUTokenizer sounds like it's implementing UAX #29, which is exactly
the standard filled with all the issues I was describing. Unless it
does more than that, I would recommend against using that also.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is StandardAnalyzer good enough for multi languages...

Posted by Steve Rowe <sa...@gmail.com>.
Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be of interest to you, along with the token filters in that same module. - Steve
 
On Jan 8, 2013, at 6:43 PM, Trejkaz <tr...@trypticon.org> wrote:

> On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi <sa...@gmail.com> wrote:
>> DoesLucene StandardAnalyzer work for all the languagues for tokenizing before
>> indexing (since we are using java, I think the content is converted to UTF-8
>> before tokenizing/indeing)?
> 
> No. There are multiple cases where it chooses not to break something
> which it should break. Some of these cases even result in undesirable
> behaviour for English, so I would be surprised if there were even a
> single language which it handles acceptably.
> 
> It does follow "Unicode standards" for how to tokenise text, but these
> standards were written by people who didn't quite know what they were
> doing so it's really just passing the buck. I don't think Lucene
> should have chosen to follow that standard in the first place, because
> it rarely (if ever) gives acceptable results.
> 
> The worst examples for English, at least for us, were that it does not
> break on colon (:) or underscore (_).
> 
> Colon was explained by some languages using it like an apostrophe.
> Personally I think you should break on an apostrophe as well, so I'm
> not really happy with this reasoning, but OK.
> 
> Underscore was completely baffling to me so I asked someone at Unicode
> about it. They explained that it was because it was "used by
> programmers to separate words in identifiers". This explanation is
> exactly as stupid as it sounds and I hope they will realise their
> stupidity some day.
> 
>> or do we need to use special analyzers for each of the language.
> 
> I do think that StandardTokenizer at least can form a good base for an
> analyser. You just have to add a ton of filters to fix each additional
> case you find where people don't like it. For instance, it returns
> runs of Katakana as a single token, but if you did that, people
> wouldn't find what they are searching for, so you make a filter to
> split that back out into multiple tokens.
> 
> It would help if there were a single, core-maintained analyser for
> "StandardAnalyzer with all the things people hate fixed"... but I
> don't know if anyone is interested in maintaining it.
> 
>> In this case, if a document has a mixed case ( english +
>> Japanese), what analyzer should we use and how can we figure it out
>> dynamically before indexing?
> 
> Some language detection libraries will give you back the fragments in
> the text and tell you which language is used for each fragment, so
> that is totally a viable option as well. You'd just make your own
> analyser which concatenates the results.
> 
>> Also, while searching if the query text contains (both english and
>> Japanese), how does this work? Any criteria in choosing the analyzers?
> 
> I guess you could either ask the user what language they're searching
> in or look at what characters are in their query and decide which
> language(s) it matches and build the query from there. It might match
> multiple...
> 
> TX
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is StandardAnalyzer good enough for multi languages...

Posted by Trejkaz <tr...@trypticon.org>.
On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi <sa...@gmail.com> wrote:
> DoesLucene StandardAnalyzer work for all the languagues for tokenizing before
> indexing (since we are using java, I think the content is converted to UTF-8
> before tokenizing/indeing)?

No. There are multiple cases where it chooses not to break something
which it should break. Some of these cases even result in undesirable
behaviour for English, so I would be surprised if there were even a
single language which it handles acceptably.

It does follow "Unicode standards" for how to tokenise text, but these
standards were written by people who didn't quite know what they were
doing so it's really just passing the buck. I don't think Lucene
should have chosen to follow that standard in the first place, because
it rarely (if ever) gives acceptable results.

The worst examples for English, at least for us, were that it does not
break on colon (:) or underscore (_).

Colon was explained by some languages using it like an apostrophe.
Personally I think you should break on an apostrophe as well, so I'm
not really happy with this reasoning, but OK.

Underscore was completely baffling to me so I asked someone at Unicode
about it. They explained that it was because it was "used by
programmers to separate words in identifiers". This explanation is
exactly as stupid as it sounds and I hope they will realise their
stupidity some day.

> or do we need to use special analyzers for each of the language.

I do think that StandardTokenizer at least can form a good base for an
analyser. You just have to add a ton of filters to fix each additional
case you find where people don't like it. For instance, it returns
runs of Katakana as a single token, but if you did that, people
wouldn't find what they are searching for, so you make a filter to
split that back out into multiple tokens.

It would help if there were a single, core-maintained analyser for
"StandardAnalyzer with all the things people hate fixed"... but I
don't know if anyone is interested in maintaining it.

> In this case, if a document has a mixed case ( english +
> Japanese), what analyzer should we use and how can we figure it out
> dynamically before indexing?

Some language detection libraries will give you back the fragments in
the text and tell you which language is used for each fragment, so
that is totally a viable option as well. You'd just make your own
analyser which concatenates the results.

> Also, while searching if the query text contains (both english and
> Japanese), how does this work? Any criteria in choosing the analyzers?

I guess you could either ask the user what language they're searching
in or look at what characters are in their query and decide which
language(s) it matches and build the query from there. It might match
multiple...

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org