You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Robert Muir <rc...@gmail.com> on 2009/06/15 19:14:19 UTC

Re: Lucene and multi-lingual Unicode - advice needed

Hi,

(Since this is an issue you brought up on the Compass forums)

I wonder what stage you are in the development process?
Have you considered SolR, or does compass provide some other
functionality that you need?

The reason I say this, is because the easiest solution might be to use
a nightly SolR for your application.

I'm not personally biased one way or the other for any particular
framework, but recently there has been some improvements added to SolR
so that the default type 'text' is pretty good for multilingual
processing.

In fact I hope in the future it will be improved in lucene so that
your decision is really based upon other application needs...

On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<os...@hotmail.com> wrote:
> Hi All!
>
>
>
> I'm new to Lucene so forgive me if this question was asked before.
>
> I have a database with records in the same table in many different languages
> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc.
> you name it.
> I've looked at what people say about Lucene and it looks like for the most
> part standard analyzers should do fine with most Unicode languages but there
> are quite a few exceptions.
> Here is some recently updated Lucene Jira thread:
> https://issues.apache.org/jira/browse/LUCENE-1488
>
> My question is, what would be the safest bet for me in terms of
> analyzers/tokenizers?
> Do I really have to write my own ones for the bunch of languages that are
> not supported?
> Did anyone already solve the problem similar to mine? I'm sure someone
> already did :)
>
> And yes, I looked at the Lucene sandbox analyzers. It just adds more
> confusion. For example why there analyzers for DE and FR? Wouldn't the
> standard analyzer (which is Unicode complaint as I understood) deal with EU
> languages just fine?
>
> Thanks in advance for advices :)
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Lucene and multi-lingual Unicode - advice needed

Posted by OBender Hotmail <os...@hotmail.com>.

Yes, thanks! I'll start with a simple one as you described and test on the languages we have at the moment.

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, June 15, 2009 10:45 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene and multi-lingual Unicode - advice needed

ok, well at first i thought you must be playing a joke on me or something...

Maybe you want to create a lucene analyzer that mimic's solr defaults.

Search the mail archives for this recent thread, and KK posted his code:
Re: How to support stemming and case folding for english content mixed
with non-english content?

Then again, maybe the sample code i gave you (whitespace + lowercase)
is good enough. By the time the company in question manages to get its
Chamorro, Cornish, Blackfoot, and Pashto testers together to evaluate
the search you will be retired :)


On Mon, Jun 15, 2009 at 10:30 PM, OBender
Hotmail<os...@hotmail.com> wrote:
> That's the thing there is no actual requirement.
> I've been presented with all the languages that company theoretically provides.
> My guess is that what I'm going to end up with is all western languages, good share of Arabic family, complete set of Eastern and Eastern European ones and of course CJK.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, June 15, 2009 9:52 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>
> Really, you have a requirement that the system should search written Cornish?
>
> I think you might have larger problems!
>
> On Mon, Jun 15, 2009 at 9:18 PM, OBender Hotmail<os...@hotmail.com> wrote:
>> Here is the list of possible languages. Don't laugh :) I know those are almost all world languages but it is a true requirement. Well, actual number will be closer to 70 not 100 but still I don't really know which ones from the list below will end up in the DB.
>>
>> -------
>> Afrikaans  Albanian Arabic Armenian Austrian Aymara Azerbaijani
>> Basque Belorussian Bemba Bengali Blackfoot Bosnian Breton Bulgarian     Canadian French  Catalan Cebuano Chamorro Chinese Chechen Cornish Croatian Czech
>> Danish Dutch
>> Ecuadorian Quechua  English  English-Portuguese Esperanto Estonian
>> Faroese Farsi Finnish Flemish French Frisian
>> Galician Georgian German Greek Guarani
>> Haitian Creole  Hausa Hawaiian Hebrew Hindi Hungarian Icelandic Indonesian     Inuktitut Irish Italian
>> Japanese
>> Kazakh Kongo Korean
>> Latin Latvian Lithuanian Luganda Luxembourgish
>> Macedonian Malagasy Malay Maori Maya Mohawk Mongolian
>> Nahuatl Norwegian
>> Papago Pashto Pidgin English Polish Portuguese (European)
>> Portuguese (Brazilian) Provencal
>> Quechua
>> Romanian Romansch Romany Ruanda Russian
>> Samoan Scottish Sepedi Serbian Shona Sicilian Slovak Slovene Somali    Sorbian Sotho Spanish Swahili Swazi Swedish
>> Tagalog Tahitian Thai Tongan Tswana Turkish Turkmen Tuvan
>> Ukrainian Urdu Uzbek Vietnamese
>> Welsh Wolof
>> Xhosa
>> Yiddish Yoruba
>> Zulu
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, June 15, 2009 5:56 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>
>> its not too bad, here would be a simple one that only breaks words on
>> whitespace and lowercases:
>>
>> public class Example extends Analyzer {
>>  public TokenStream tokenStream(String fieldName, Reader reader) {
>>   TokenStream ts = new WhitespaceTokenizer(reader);
>>   ts = new LowerCaseFilter(ts);
>>   return ts;
>>  }
>> }
>>
>> can you give a better idea as to what languages you have and what your
>> search requirements are (accent marks, punctuation, etc etc) ?
>>
>> On Mon, Jun 15, 2009 at 5:39 PM, OBender Hotmail<os...@hotmail.com> wrote:
>>> I've looked over SolR quickly, it is a bit too heavy for my project.
>>> So what is required (at a minimum) to build an analyzer, sandbox has a few of them varying in complexity.
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Monday, June 15, 2009 4:51 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>>
>>> Well just reply back if SolR is inappropriate for your needs.
>>>
>>> In that case, you will need to build a custom analyzer (its not too
>>> bad), so that you can use compass.
>>>
>>> On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail<os...@hotmail.com> wrote:
>>>> Hi,
>>>>
>>>> My goal is to find a framework that encapsulates as much low level indexing/search technology as possible and have it integrate nicely with Spring.
>>>> It looked like Compass was/is a good encapsulation of the functionality. I'll take a look at SolR though, thanks for the pointer.
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>> Sent: Monday, June 15, 2009 1:14 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>>>
>>>> Hi,
>>>>
>>>> (Since this is an issue you brought up on the Compass forums)
>>>>
>>>> I wonder what stage you are in the development process?
>>>> Have you considered SolR, or does compass provide some other
>>>> functionality that you need?
>>>>
>>>> The reason I say this, is because the easiest solution might be to use
>>>> a nightly SolR for your application.
>>>>
>>>> I'm not personally biased one way or the other for any particular
>>>> framework, but recently there has been some improvements added to SolR
>>>> so that the default type 'text' is pretty good for multilingual
>>>> processing.
>>>>
>>>> In fact I hope in the future it will be improved in lucene so that
>>>> your decision is really based upon other application needs...
>>>>
>>>> On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<os...@hotmail.com> wrote:
>>>>> Hi All!
>>>>>
>>>>>
>>>>>
>>>>> I'm new to Lucene so forgive me if this question was asked before.
>>>>>
>>>>> I have a database with records in the same table in many different languages
>>>>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc.
>>>>> you name it.
>>>>> I've looked at what people say about Lucene and it looks like for the most
>>>>> part standard analyzers should do fine with most Unicode languages but there
>>>>> are quite a few exceptions.
>>>>> Here is some recently updated Lucene Jira thread:
>>>>> https://issues.apache.org/jira/browse/LUCENE-1488
>>>>>
>>>>> My question is, what would be the safest bet for me in terms of
>>>>> analyzers/tokenizers?
>>>>> Do I really have to write my own ones for the bunch of languages that are
>>>>> not supported?
>>>>> Did anyone already solve the problem similar to mine? I'm sure someone
>>>>> already did :)
>>>>>
>>>>> And yes, I looked at the Lucene sandbox analyzers. It just adds more
>>>>> confusion. For example why there analyzers for DE and FR? Wouldn't the
>>>>> standard analyzer (which is Unicode complaint as I understood) deal with EU
>>>>> languages just fine?
>>>>>
>>>>> Thanks in advance for advices :)
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcmuir@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 8.5.339 / Virus Database: 270.12.62/2168 - Release Date: 06/15/09 17:54:00
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.339 / Virus Database: 270.12.62/2168 - Release Date: 06/15/09 17:54:00


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene and multi-lingual Unicode - advice needed

Posted by Robert Muir <rc...@gmail.com>.

ok, well at first i thought you must be playing a joke on me or something...

Maybe you want to create a lucene analyzer that mimic's solr defaults.

Search the mail archives for this recent thread, and KK posted his code:
Re: How to support stemming and case folding for english content mixed
with non-english content?

Then again, maybe the sample code i gave you (whitespace + lowercase)
is good enough. By the time the company in question manages to get its
Chamorro, Cornish, Blackfoot, and Pashto testers together to evaluate
the search you will be retired :)


On Mon, Jun 15, 2009 at 10:30 PM, OBender
Hotmail<os...@hotmail.com> wrote:
> That's the thing there is no actual requirement.
> I've been presented with all the languages that company theoretically provides.
> My guess is that what I'm going to end up with is all western languages, good share of Arabic family, complete set of Eastern and Eastern European ones and of course CJK.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, June 15, 2009 9:52 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>
> Really, you have a requirement that the system should search written Cornish?
>
> I think you might have larger problems!
>
> On Mon, Jun 15, 2009 at 9:18 PM, OBender Hotmail<os...@hotmail.com> wrote:
>> Here is the list of possible languages. Don't laugh :) I know those are almost all world languages but it is a true requirement. Well, actual number will be closer to 70 not 100 but still I don't really know which ones from the list below will end up in the DB.
>>
>> -------
>> Afrikaans  Albanian Arabic Armenian Austrian Aymara Azerbaijani
>> Basque Belorussian Bemba Bengali Blackfoot Bosnian Breton Bulgarian     Canadian French  Catalan Cebuano Chamorro Chinese Chechen Cornish Croatian Czech
>> Danish Dutch
>> Ecuadorian Quechua  English  English-Portuguese Esperanto Estonian
>> Faroese Farsi Finnish Flemish French Frisian
>> Galician Georgian German Greek Guarani
>> Haitian Creole  Hausa Hawaiian Hebrew Hindi Hungarian Icelandic Indonesian     Inuktitut Irish Italian
>> Japanese
>> Kazakh Kongo Korean
>> Latin Latvian Lithuanian Luganda Luxembourgish
>> Macedonian Malagasy Malay Maori Maya Mohawk Mongolian
>> Nahuatl Norwegian
>> Papago Pashto Pidgin English Polish Portuguese (European)
>> Portuguese (Brazilian) Provencal
>> Quechua
>> Romanian Romansch Romany Ruanda Russian
>> Samoan Scottish Sepedi Serbian Shona Sicilian Slovak Slovene Somali    Sorbian Sotho Spanish Swahili Swazi Swedish
>> Tagalog Tahitian Thai Tongan Tswana Turkish Turkmen Tuvan
>> Ukrainian Urdu Uzbek Vietnamese
>> Welsh Wolof
>> Xhosa
>> Yiddish Yoruba
>> Zulu
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, June 15, 2009 5:56 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>
>> its not too bad, here would be a simple one that only breaks words on
>> whitespace and lowercases:
>>
>> public class Example extends Analyzer {
>>  public TokenStream tokenStream(String fieldName, Reader reader) {
>>   TokenStream ts = new WhitespaceTokenizer(reader);
>>   ts = new LowerCaseFilter(ts);
>>   return ts;
>>  }
>> }
>>
>> can you give a better idea as to what languages you have and what your
>> search requirements are (accent marks, punctuation, etc etc) ?
>>
>> On Mon, Jun 15, 2009 at 5:39 PM, OBender Hotmail<os...@hotmail.com> wrote:
>>> I've looked over SolR quickly, it is a bit too heavy for my project.
>>> So what is required (at a minimum) to build an analyzer, sandbox has a few of them varying in complexity.
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Monday, June 15, 2009 4:51 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>>
>>> Well just reply back if SolR is inappropriate for your needs.
>>>
>>> In that case, you will need to build a custom analyzer (its not too
>>> bad), so that you can use compass.
>>>
>>> On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail<os...@hotmail.com> wrote:
>>>> Hi,
>>>>
>>>> My goal is to find a framework that encapsulates as much low level indexing/search technology as possible and have it integrate nicely with Spring.
>>>> It looked like Compass was/is a good encapsulation of the functionality. I'll take a look at SolR though, thanks for the pointer.
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>> Sent: Monday, June 15, 2009 1:14 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>>>
>>>> Hi,
>>>>
>>>> (Since this is an issue you brought up on the Compass forums)
>>>>
>>>> I wonder what stage you are in the development process?
>>>> Have you considered SolR, or does compass provide some other
>>>> functionality that you need?
>>>>
>>>> The reason I say this, is because the easiest solution might be to use
>>>> a nightly SolR for your application.
>>>>
>>>> I'm not personally biased one way or the other for any particular
>>>> framework, but recently there has been some improvements added to SolR
>>>> so that the default type 'text' is pretty good for multilingual
>>>> processing.
>>>>
>>>> In fact I hope in the future it will be improved in lucene so that
>>>> your decision is really based upon other application needs...
>>>>
>>>> On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<os...@hotmail.com> wrote:
>>>>> Hi All!
>>>>>
>>>>>
>>>>>
>>>>> I'm new to Lucene so forgive me if this question was asked before.
>>>>>
>>>>> I have a database with records in the same table in many different languages
>>>>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc.
>>>>> you name it.
>>>>> I've looked at what people say about Lucene and it looks like for the most
>>>>> part standard analyzers should do fine with most Unicode languages but there
>>>>> are quite a few exceptions.
>>>>> Here is some recently updated Lucene Jira thread:
>>>>> https://issues.apache.org/jira/browse/LUCENE-1488
>>>>>
>>>>> My question is, what would be the safest bet for me in terms of
>>>>> analyzers/tokenizers?
>>>>> Do I really have to write my own ones for the bunch of languages that are
>>>>> not supported?
>>>>> Did anyone already solve the problem similar to mine? I'm sure someone
>>>>> already did :)
>>>>>
>>>>> And yes, I looked at the Lucene sandbox analyzers. It just adds more
>>>>> confusion. For example why there analyzers for DE and FR? Wouldn't the
>>>>> standard analyzer (which is Unicode complaint as I understood) deal with EU
>>>>> languages just fine?
>>>>>
>>>>> Thanks in advance for advices :)
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcmuir@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 8.5.339 / Virus Database: 270.12.62/2168 - Release Date: 06/15/09 17:54:00
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Lucene and multi-lingual Unicode - advice needed

Posted by OBender Hotmail <os...@hotmail.com>.

That's the thing there is no actual requirement.
I've been presented with all the languages that company theoretically provides.
My guess is that what I'm going to end up with is all western languages, good share of Arabic family, complete set of Eastern and Eastern European ones and of course CJK.

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, June 15, 2009 9:52 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene and multi-lingual Unicode - advice needed

Really, you have a requirement that the system should search written Cornish?

I think you might have larger problems!

On Mon, Jun 15, 2009 at 9:18 PM, OBender Hotmail<os...@hotmail.com> wrote:
> Here is the list of possible languages. Don't laugh :) I know those are almost all world languages but it is a true requirement. Well, actual number will be closer to 70 not 100 but still I don't really know which ones from the list below will end up in the DB.
>
> -------
> Afrikaans  Albanian Arabic Armenian Austrian Aymara Azerbaijani
> Basque Belorussian Bemba Bengali Blackfoot Bosnian Breton Bulgarian     Canadian French  Catalan Cebuano Chamorro Chinese Chechen Cornish Croatian Czech
> Danish Dutch
> Ecuadorian Quechua  English  English-Portuguese Esperanto Estonian
> Faroese Farsi Finnish Flemish French Frisian
> Galician Georgian German Greek Guarani
> Haitian Creole  Hausa Hawaiian Hebrew Hindi Hungarian Icelandic Indonesian     Inuktitut Irish Italian
> Japanese
> Kazakh Kongo Korean
> Latin Latvian Lithuanian Luganda Luxembourgish
> Macedonian Malagasy Malay Maori Maya Mohawk Mongolian
> Nahuatl Norwegian
> Papago Pashto Pidgin English Polish Portuguese (European)
> Portuguese (Brazilian) Provencal
> Quechua
> Romanian Romansch Romany Ruanda Russian
> Samoan Scottish Sepedi Serbian Shona Sicilian Slovak Slovene Somali    Sorbian Sotho Spanish Swahili Swazi Swedish
> Tagalog Tahitian Thai Tongan Tswana Turkish Turkmen Tuvan
> Ukrainian Urdu Uzbek Vietnamese
> Welsh Wolof
> Xhosa
> Yiddish Yoruba
> Zulu
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, June 15, 2009 5:56 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>
> its not too bad, here would be a simple one that only breaks words on
> whitespace and lowercases:
>
> public class Example extends Analyzer {
>  public TokenStream tokenStream(String fieldName, Reader reader) {
>   TokenStream ts = new WhitespaceTokenizer(reader);
>   ts = new LowerCaseFilter(ts);
>   return ts;
>  }
> }
>
> can you give a better idea as to what languages you have and what your
> search requirements are (accent marks, punctuation, etc etc) ?
>
> On Mon, Jun 15, 2009 at 5:39 PM, OBender Hotmail<os...@hotmail.com> wrote:
>> I've looked over SolR quickly, it is a bit too heavy for my project.
>> So what is required (at a minimum) to build an analyzer, sandbox has a few of them varying in complexity.
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, June 15, 2009 4:51 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>
>> Well just reply back if SolR is inappropriate for your needs.
>>
>> In that case, you will need to build a custom analyzer (its not too
>> bad), so that you can use compass.
>>
>> On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail<os...@hotmail.com> wrote:
>>> Hi,
>>>
>>> My goal is to find a framework that encapsulates as much low level indexing/search technology as possible and have it integrate nicely with Spring.
>>> It looked like Compass was/is a good encapsulation of the functionality. I'll take a look at SolR though, thanks for the pointer.
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Monday, June 15, 2009 1:14 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>>
>>> Hi,
>>>
>>> (Since this is an issue you brought up on the Compass forums)
>>>
>>> I wonder what stage you are in the development process?
>>> Have you considered SolR, or does compass provide some other
>>> functionality that you need?
>>>
>>> The reason I say this, is because the easiest solution might be to use
>>> a nightly SolR for your application.
>>>
>>> I'm not personally biased one way or the other for any particular
>>> framework, but recently there has been some improvements added to SolR
>>> so that the default type 'text' is pretty good for multilingual
>>> processing.
>>>
>>> In fact I hope in the future it will be improved in lucene so that
>>> your decision is really based upon other application needs...
>>>
>>> On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<os...@hotmail.com> wrote:
>>>> Hi All!
>>>>
>>>>
>>>>
>>>> I'm new to Lucene so forgive me if this question was asked before.
>>>>
>>>> I have a database with records in the same table in many different languages
>>>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc.
>>>> you name it.
>>>> I've looked at what people say about Lucene and it looks like for the most
>>>> part standard analyzers should do fine with most Unicode languages but there
>>>> are quite a few exceptions.
>>>> Here is some recently updated Lucene Jira thread:
>>>> https://issues.apache.org/jira/browse/LUCENE-1488
>>>>
>>>> My question is, what would be the safest bet for me in terms of
>>>> analyzers/tokenizers?
>>>> Do I really have to write my own ones for the bunch of languages that are
>>>> not supported?
>>>> Did anyone already solve the problem similar to mine? I'm sure someone
>>>> already did :)
>>>>
>>>> And yes, I looked at the Lucene sandbox analyzers. It just adds more
>>>> confusion. For example why there analyzers for DE and FR? Wouldn't the
>>>> standard analyzer (which is Unicode complaint as I understood) deal with EU
>>>> languages just fine?
>>>>
>>>> Thanks in advance for advices :)
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.339 / Virus Database: 270.12.62/2168 - Release Date: 06/15/09 17:54:00
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene and multi-lingual Unicode - advice needed

Posted by Robert Muir <rc...@gmail.com>.

Really, you have a requirement that the system should search written Cornish?

I think you might have larger problems!

On Mon, Jun 15, 2009 at 9:18 PM, OBender Hotmail<os...@hotmail.com> wrote:
> Here is the list of possible languages. Don't laugh :) I know those are almost all world languages but it is a true requirement. Well, actual number will be closer to 70 not 100 but still I don't really know which ones from the list below will end up in the DB.
>
> -------
> Afrikaans  Albanian Arabic Armenian Austrian Aymara Azerbaijani
> Basque Belorussian Bemba Bengali Blackfoot Bosnian Breton Bulgarian     Canadian French  Catalan Cebuano Chamorro Chinese Chechen Cornish Croatian Czech
> Danish Dutch
> Ecuadorian Quechua  English  English-Portuguese Esperanto Estonian
> Faroese Farsi Finnish Flemish French Frisian
> Galician Georgian German Greek Guarani
> Haitian Creole  Hausa Hawaiian Hebrew Hindi Hungarian Icelandic Indonesian     Inuktitut Irish Italian
> Japanese
> Kazakh Kongo Korean
> Latin Latvian Lithuanian Luganda Luxembourgish
> Macedonian Malagasy Malay Maori Maya Mohawk Mongolian
> Nahuatl Norwegian
> Papago Pashto Pidgin English Polish Portuguese (European)
> Portuguese (Brazilian) Provençal
> Quechua
> Romanian Romansch Romany Ruanda Russian
> Samoan Scottish Sepedi Serbian Shona Sicilian Slovak Slovene Somali    Sorbian Sotho Spanish Swahili Swazi Swedish
> Tagalog Tahitian Thai Tongan Tswana Turkish Turkmen Tuvan
> Ukrainian Urdu Uzbek Vietnamese
> Welsh Wolof
> Xhosa
> Yiddish Yoruba
> Zulu
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, June 15, 2009 5:56 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>
> its not too bad, here would be a simple one that only breaks words on
> whitespace and lowercases:
>
> public class Example extends Analyzer {
>  public TokenStream tokenStream(String fieldName, Reader reader) {
>   TokenStream ts = new WhitespaceTokenizer(reader);
>   ts = new LowerCaseFilter(ts);
>   return ts;
>  }
> }
>
> can you give a better idea as to what languages you have and what your
> search requirements are (accent marks, punctuation, etc etc) ?
>
> On Mon, Jun 15, 2009 at 5:39 PM, OBender Hotmail<os...@hotmail.com> wrote:
>> I've looked over SolR quickly, it is a bit too heavy for my project.
>> So what is required (at a minimum) to build an analyzer, sandbox has a few of them varying in complexity.
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, June 15, 2009 4:51 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>
>> Well just reply back if SolR is inappropriate for your needs.
>>
>> In that case, you will need to build a custom analyzer (its not too
>> bad), so that you can use compass.
>>
>> On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail<os...@hotmail.com> wrote:
>>> Hi,
>>>
>>> My goal is to find a framework that encapsulates as much low level indexing/search technology as possible and have it integrate nicely with Spring.
>>> It looked like Compass was/is a good encapsulation of the functionality. I'll take a look at SolR though, thanks for the pointer.
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Monday, June 15, 2009 1:14 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>>
>>> Hi,
>>>
>>> (Since this is an issue you brought up on the Compass forums)
>>>
>>> I wonder what stage you are in the development process?
>>> Have you considered SolR, or does compass provide some other
>>> functionality that you need?
>>>
>>> The reason I say this, is because the easiest solution might be to use
>>> a nightly SolR for your application.
>>>
>>> I'm not personally biased one way or the other for any particular
>>> framework, but recently there has been some improvements added to SolR
>>> so that the default type 'text' is pretty good for multilingual
>>> processing.
>>>
>>> In fact I hope in the future it will be improved in lucene so that
>>> your decision is really based upon other application needs...
>>>
>>> On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<os...@hotmail.com> wrote:
>>>> Hi All!
>>>>
>>>>
>>>>
>>>> I'm new to Lucene so forgive me if this question was asked before.
>>>>
>>>> I have a database with records in the same table in many different languages
>>>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc.
>>>> you name it.
>>>> I've looked at what people say about Lucene and it looks like for the most
>>>> part standard analyzers should do fine with most Unicode languages but there
>>>> are quite a few exceptions.
>>>> Here is some recently updated Lucene Jira thread:
>>>> https://issues.apache.org/jira/browse/LUCENE-1488
>>>>
>>>> My question is, what would be the safest bet for me in terms of
>>>> analyzers/tokenizers?
>>>> Do I really have to write my own ones for the bunch of languages that are
>>>> not supported?
>>>> Did anyone already solve the problem similar to mine? I'm sure someone
>>>> already did :)
>>>>
>>>> And yes, I looked at the Lucene sandbox analyzers. It just adds more
>>>> confusion. For example why there analyzers for DE and FR? Wouldn't the
>>>> standard analyzer (which is Unicode complaint as I understood) deal with EU
>>>> languages just fine?
>>>>
>>>> Thanks in advance for advices :)
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.339 / Virus Database: 270.12.62/2168 - Release Date: 06/15/09 17:54:00
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Lucene and multi-lingual Unicode - advice needed

Posted by OBender Hotmail <os...@hotmail.com>.

Here is the list of possible languages. Don't laugh :) I know those are almost all world languages but it is a true requirement. Well, actual number will be closer to 70 not 100 but still I don't really know which ones from the list below will end up in the DB.

-------
Afrikaans  Albanian Arabic Armenian Austrian Aymara Azerbaijani
Basque Belorussian Bemba Bengali Blackfoot Bosnian Breton Bulgarian     Canadian French  Catalan Cebuano Chamorro Chinese Chechen Cornish Croatian Czech
Danish Dutch
Ecuadorian Quechua  English  English-Portuguese Esperanto Estonian
Faroese Farsi Finnish Flemish French Frisian 
Galician Georgian German Greek Guarani
Haitian Creole  Hausa Hawaiian Hebrew Hindi Hungarian Icelandic Indonesian     Inuktitut Irish Italian
Japanese
Kazakh Kongo Korean
Latin Latvian Lithuanian Luganda Luxembourgish
Macedonian Malagasy Malay Maori Maya Mohawk Mongolian
Nahuatl Norwegian
Papago Pashto Pidgin English Polish Portuguese (European) 
Portuguese (Brazilian) Provençal
Quechua
Romanian Romansch Romany Ruanda Russian
Samoan Scottish Sepedi Serbian Shona Sicilian Slovak Slovene Somali    Sorbian Sotho Spanish Swahili Swazi Swedish
Tagalog Tahitian Thai Tongan Tswana Turkish Turkmen Tuvan
Ukrainian Urdu Uzbek Vietnamese
Welsh Wolof
Xhosa
Yiddish Yoruba
Zulu

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, June 15, 2009 5:56 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene and multi-lingual Unicode - advice needed

its not too bad, here would be a simple one that only breaks words on
whitespace and lowercases:

public class Example extends Analyzer {
 public TokenStream tokenStream(String fieldName, Reader reader) {
   TokenStream ts = new WhitespaceTokenizer(reader);
   ts = new LowerCaseFilter(ts);
   return ts;
 }
}

can you give a better idea as to what languages you have and what your
search requirements are (accent marks, punctuation, etc etc) ?

On Mon, Jun 15, 2009 at 5:39 PM, OBender Hotmail<os...@hotmail.com> wrote:
> I've looked over SolR quickly, it is a bit too heavy for my project.
> So what is required (at a minimum) to build an analyzer, sandbox has a few of them varying in complexity.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, June 15, 2009 4:51 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>
> Well just reply back if SolR is inappropriate for your needs.
>
> In that case, you will need to build a custom analyzer (its not too
> bad), so that you can use compass.
>
> On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail<os...@hotmail.com> wrote:
>> Hi,
>>
>> My goal is to find a framework that encapsulates as much low level indexing/search technology as possible and have it integrate nicely with Spring.
>> It looked like Compass was/is a good encapsulation of the functionality. I'll take a look at SolR though, thanks for the pointer.
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, June 15, 2009 1:14 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>
>> Hi,
>>
>> (Since this is an issue you brought up on the Compass forums)
>>
>> I wonder what stage you are in the development process?
>> Have you considered SolR, or does compass provide some other
>> functionality that you need?
>>
>> The reason I say this, is because the easiest solution might be to use
>> a nightly SolR for your application.
>>
>> I'm not personally biased one way or the other for any particular
>> framework, but recently there has been some improvements added to SolR
>> so that the default type 'text' is pretty good for multilingual
>> processing.
>>
>> In fact I hope in the future it will be improved in lucene so that
>> your decision is really based upon other application needs...
>>
>> On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<os...@hotmail.com> wrote:
>>> Hi All!
>>>
>>>
>>>
>>> I'm new to Lucene so forgive me if this question was asked before.
>>>
>>> I have a database with records in the same table in many different languages
>>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc.
>>> you name it.
>>> I've looked at what people say about Lucene and it looks like for the most
>>> part standard analyzers should do fine with most Unicode languages but there
>>> are quite a few exceptions.
>>> Here is some recently updated Lucene Jira thread:
>>> https://issues.apache.org/jira/browse/LUCENE-1488
>>>
>>> My question is, what would be the safest bet for me in terms of
>>> analyzers/tokenizers?
>>> Do I really have to write my own ones for the bunch of languages that are
>>> not supported?
>>> Did anyone already solve the problem similar to mine? I'm sure someone
>>> already did :)
>>>
>>> And yes, I looked at the Lucene sandbox analyzers. It just adds more
>>> confusion. For example why there analyzers for DE and FR? Wouldn't the
>>> standard analyzer (which is Unicode complaint as I understood) deal with EU
>>> languages just fine?
>>>
>>> Thanks in advance for advices :)
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.339 / Virus Database: 270.12.62/2168 - Release Date: 06/15/09 17:54:00


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene and multi-lingual Unicode - advice needed

Posted by Robert Muir <rc...@gmail.com>.

its not too bad, here would be a simple one that only breaks words on
whitespace and lowercases:

public class Example extends Analyzer {
 public TokenStream tokenStream(String fieldName, Reader reader) {
   TokenStream ts = new WhitespaceTokenizer(reader);
   ts = new LowerCaseFilter(ts);
   return ts;
 }
}

can you give a better idea as to what languages you have and what your
search requirements are (accent marks, punctuation, etc etc) ?

On Mon, Jun 15, 2009 at 5:39 PM, OBender Hotmail<os...@hotmail.com> wrote:
> I've looked over SolR quickly, it is a bit too heavy for my project.
> So what is required (at a minimum) to build an analyzer, sandbox has a few of them varying in complexity.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, June 15, 2009 4:51 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>
> Well just reply back if SolR is inappropriate for your needs.
>
> In that case, you will need to build a custom analyzer (its not too
> bad), so that you can use compass.
>
> On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail<os...@hotmail.com> wrote:
>> Hi,
>>
>> My goal is to find a framework that encapsulates as much low level indexing/search technology as possible and have it integrate nicely with Spring.
>> It looked like Compass was/is a good encapsulation of the functionality. I'll take a look at SolR though, thanks for the pointer.
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, June 15, 2009 1:14 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>
>> Hi,
>>
>> (Since this is an issue you brought up on the Compass forums)
>>
>> I wonder what stage you are in the development process?
>> Have you considered SolR, or does compass provide some other
>> functionality that you need?
>>
>> The reason I say this, is because the easiest solution might be to use
>> a nightly SolR for your application.
>>
>> I'm not personally biased one way or the other for any particular
>> framework, but recently there has been some improvements added to SolR
>> so that the default type 'text' is pretty good for multilingual
>> processing.
>>
>> In fact I hope in the future it will be improved in lucene so that
>> your decision is really based upon other application needs...
>>
>> On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<os...@hotmail.com> wrote:
>>> Hi All!
>>>
>>>
>>>
>>> I'm new to Lucene so forgive me if this question was asked before.
>>>
>>> I have a database with records in the same table in many different languages
>>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc.
>>> you name it.
>>> I've looked at what people say about Lucene and it looks like for the most
>>> part standard analyzers should do fine with most Unicode languages but there
>>> are quite a few exceptions.
>>> Here is some recently updated Lucene Jira thread:
>>> https://issues.apache.org/jira/browse/LUCENE-1488
>>>
>>> My question is, what would be the safest bet for me in terms of
>>> analyzers/tokenizers?
>>> Do I really have to write my own ones for the bunch of languages that are
>>> not supported?
>>> Did anyone already solve the problem similar to mine? I'm sure someone
>>> already did :)
>>>
>>> And yes, I looked at the Lucene sandbox analyzers. It just adds more
>>> confusion. For example why there analyzers for DE and FR? Wouldn't the
>>> standard analyzer (which is Unicode complaint as I understood) deal with EU
>>> languages just fine?
>>>
>>> Thanks in advance for advices :)
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Lucene and multi-lingual Unicode - advice needed

Posted by OBender Hotmail <os...@hotmail.com>.

I've looked over SolR quickly, it is a bit too heavy for my project.
So what is required (at a minimum) to build an analyzer, sandbox has a few of them varying in complexity.

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, June 15, 2009 4:51 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene and multi-lingual Unicode - advice needed

Well just reply back if SolR is inappropriate for your needs.

In that case, you will need to build a custom analyzer (its not too
bad), so that you can use compass.

On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail<os...@hotmail.com> wrote:
> Hi,
>
> My goal is to find a framework that encapsulates as much low level indexing/search technology as possible and have it integrate nicely with Spring.
> It looked like Compass was/is a good encapsulation of the functionality. I'll take a look at SolR though, thanks for the pointer.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, June 15, 2009 1:14 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>
> Hi,
>
> (Since this is an issue you brought up on the Compass forums)
>
> I wonder what stage you are in the development process?
> Have you considered SolR, or does compass provide some other
> functionality that you need?
>
> The reason I say this, is because the easiest solution might be to use
> a nightly SolR for your application.
>
> I'm not personally biased one way or the other for any particular
> framework, but recently there has been some improvements added to SolR
> so that the default type 'text' is pretty good for multilingual
> processing.
>
> In fact I hope in the future it will be improved in lucene so that
> your decision is really based upon other application needs...
>
> On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<os...@hotmail.com> wrote:
>> Hi All!
>>
>>
>>
>> I'm new to Lucene so forgive me if this question was asked before.
>>
>> I have a database with records in the same table in many different languages
>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc.
>> you name it.
>> I've looked at what people say about Lucene and it looks like for the most
>> part standard analyzers should do fine with most Unicode languages but there
>> are quite a few exceptions.
>> Here is some recently updated Lucene Jira thread:
>> https://issues.apache.org/jira/browse/LUCENE-1488
>>
>> My question is, what would be the safest bet for me in terms of
>> analyzers/tokenizers?
>> Do I really have to write my own ones for the bunch of languages that are
>> not supported?
>> Did anyone already solve the problem similar to mine? I'm sure someone
>> already did :)
>>
>> And yes, I looked at the Lucene sandbox analyzers. It just adds more
>> confusion. For example why there analyzers for DE and FR? Wouldn't the
>> standard analyzer (which is Unicode complaint as I understood) deal with EU
>> languages just fine?
>>
>> Thanks in advance for advices :)
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene and multi-lingual Unicode - advice needed

Posted by Robert Muir <rc...@gmail.com>.

Well just reply back if SolR is inappropriate for your needs.

In that case, you will need to build a custom analyzer (its not too
bad), so that you can use compass.

On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail<os...@hotmail.com> wrote:
> Hi,
>
> My goal is to find a framework that encapsulates as much low level indexing/search technology as possible and have it integrate nicely with Spring.
> It looked like Compass was/is a good encapsulation of the functionality. I'll take a look at SolR though, thanks for the pointer.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, June 15, 2009 1:14 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>
> Hi,
>
> (Since this is an issue you brought up on the Compass forums)
>
> I wonder what stage you are in the development process?
> Have you considered SolR, or does compass provide some other
> functionality that you need?
>
> The reason I say this, is because the easiest solution might be to use
> a nightly SolR for your application.
>
> I'm not personally biased one way or the other for any particular
> framework, but recently there has been some improvements added to SolR
> so that the default type 'text' is pretty good for multilingual
> processing.
>
> In fact I hope in the future it will be improved in lucene so that
> your decision is really based upon other application needs...
>
> On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<os...@hotmail.com> wrote:
>> Hi All!
>>
>>
>>
>> I'm new to Lucene so forgive me if this question was asked before.
>>
>> I have a database with records in the same table in many different languages
>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc.
>> you name it.
>> I've looked at what people say about Lucene and it looks like for the most
>> part standard analyzers should do fine with most Unicode languages but there
>> are quite a few exceptions.
>> Here is some recently updated Lucene Jira thread:
>> https://issues.apache.org/jira/browse/LUCENE-1488
>>
>> My question is, what would be the safest bet for me in terms of
>> analyzers/tokenizers?
>> Do I really have to write my own ones for the bunch of languages that are
>> not supported?
>> Did anyone already solve the problem similar to mine? I'm sure someone
>> already did :)
>>
>> And yes, I looked at the Lucene sandbox analyzers. It just adds more
>> confusion. For example why there analyzers for DE and FR? Wouldn't the
>> standard analyzer (which is Unicode complaint as I understood) deal with EU
>> languages just fine?
>>
>> Thanks in advance for advices :)
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Lucene and multi-lingual Unicode - advice needed

Posted by OBender Hotmail <os...@hotmail.com>.

Hi,

My goal is to find a framework that encapsulates as much low level indexing/search technology as possible and have it integrate nicely with Spring.
It looked like Compass was/is a good encapsulation of the functionality. I'll take a look at SolR though, thanks for the pointer. 

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, June 15, 2009 1:14 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene and multi-lingual Unicode - advice needed

Hi,

(Since this is an issue you brought up on the Compass forums)

I wonder what stage you are in the development process?
Have you considered SolR, or does compass provide some other
functionality that you need?

The reason I say this, is because the easiest solution might be to use
a nightly SolR for your application.

I'm not personally biased one way or the other for any particular
framework, but recently there has been some improvements added to SolR
so that the default type 'text' is pretty good for multilingual
processing.

In fact I hope in the future it will be improved in lucene so that
your decision is really based upon other application needs...

On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<os...@hotmail.com> wrote:
> Hi All!
>
>
>
> I'm new to Lucene so forgive me if this question was asked before.
>
> I have a database with records in the same table in many different languages
> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc.
> you name it.
> I've looked at what people say about Lucene and it looks like for the most
> part standard analyzers should do fine with most Unicode languages but there
> are quite a few exceptions.
> Here is some recently updated Lucene Jira thread:
> https://issues.apache.org/jira/browse/LUCENE-1488
>
> My question is, what would be the safest bet for me in terms of
> analyzers/tokenizers?
> Do I really have to write my own ones for the bunch of languages that are
> not supported?
> Did anyone already solve the problem similar to mine? I'm sure someone
> already did :)
>
> And yes, I looked at the Lucene sandbox analyzers. It just adds more
> confusion. For example why there analyzers for DE and FR? Wouldn't the
> standard analyzer (which is Unicode complaint as I understood) deal with EU
> languages just fine?
>
> Thanks in advance for advices :)
>
>

-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene and multi-lingual Unicode - advice needed

Posted by Robert Muir <rc...@gmail.com>.

Michael, here is the Solr change:
http://issues.apache.org/jira/browse/SOLR-1078

Here is the issue to import it into lucene.
http://issues.apache.org/jira/browse/LUCENE-1377


On Tue, Jun 16, 2009 at 6:01 AM, Michael
McCandless<lu...@mikemccandless.com> wrote:
> On Mon, Jun 15, 2009 at 1:14 PM, Robert Muir<rc...@gmail.com> wrote:
>
>> I'm not personally biased one way or the other for any particular
>> framework, but recently there has been some improvements added to SolR
>> so that the default type 'text' is pretty good for multilingual
>> processing.
>
> Robert can you describe what these changes are/were, and how one could
> do the same thing as Solr's "text" field, directly with Lucene?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene and multi-lingual Unicode - advice needed

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, Jun 15, 2009 at 1:14 PM, Robert Muir<rc...@gmail.com> wrote:

> I'm not personally biased one way or the other for any particular
> framework, but recently there has been some improvements added to SolR
> so that the default type 'text' is pretty good for multilingual
> processing.

Robert can you describe what these changes are/were, and how one could
do the same thing as Solr's "text" field, directly with Lucene?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org