You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Vasiliki Gkouta <vg...@csd.auth.gr> on 2011/03/14 00:23:56 UTC
Analyzer enquiry
Hello everybody,
I have an enquiry about StandardAnalyzer. Can I use it for other
languages except from English? I give the right list of stop words at
initialization. Is there anything else inside the class that is by
default set in English?
I've found the Analyzers for other languages too but they where seem
to be deprecated.. Moreover I use english and other languages, all
together in my project so I would like to ask if there is a way to use
either the same class analyzer for all of them, or analyzers of the
same functionality for all the languages. Thanks in advance!
Best regards,
Vicky
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Analyzer enquiry
Posted by Vasiliki Gkouta <vg...@csd.auth.gr>.
Thank you for your help!
Best Regards,
Vicky
Quoting Erick Erickson <er...@gmail.com>:
> Nope, that should do it.
>
> Best
> Erick
>
> On Mon, Mar 14, 2011 at 9:35 AM, Vasiliki Gkouta <vg...@csd.auth.gr> wrote:
>> Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and use
>> no stemmers. At the one analyzer I passed a german stop words set to the
>> constructor and at the other one I passed an english stop words set. My
>> question was if I have to call any other function of the german analyzer for
>> it to be corrent.
>>
>> Thank you.
>>
>>
>> Quoting Erick Erickson <er...@gmail.com>:
>>
>>> I don't understand what you're saying here. If you put a stemmer in the
>>> constructor, you *are* using it. If you don't specify any stemmer at all,
>>> you
>>> still have to define different analyzers to use different stop word lists.
>>>
>>> Can you restate your question?
>>>
>>> Best
>>> Erick
>>>
>>> On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta <vg...@csd.auth.gr>
>>> wrote:
>>>>
>>>> Thanks a lot for your help Erick! About the fields you mentioned: If I
>>>> don't
>>>> use stemmers, except for the constructor argument related to the stop
>>>> words,
>>>> is there anything else that I have to modify?
>>>>
>>>> Thanks,
>>>> Vicky
>>>>
>>>>
>>>> Quoting Erick Erickson <er...@gmail.com>:
>>>>
>>>>> StandardAnalyzer works well for most European languages. The problem
>>>>> will
>>>>> be stemming. Applying stemming via English rules to non-English
>>>>> languages
>>>>> produces...er...interesting results.
>>>>>
>>>>> You can go ahead and create language-specific fields for each language
>>>>> and
>>>>> use StandardAnalyzer with the appropriate stopwords and stemming with
>>>>> each,
>>>>> this is a common approach.. The Snowball stemmer takes a language
>>>>> parameter...
>>>>>
>>>>> You need to use specific analyzers for Chinese Japanese Korean (CJK)
>>>>> documents
>>>>> though.
>>>>>
>>>>> Hope that helps
>>>>> Erick
>>>>>
>>>>> On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta <vg...@csd.auth.gr>
>>>>> wrote:
>>>>>>
>>>>>> Hello everybody,
>>>>>>
>>>>>> I have an enquiry about StandardAnalyzer. Can I use it for other
>>>>>> languages
>>>>>> except from English? I give the right list of stop words at
>>>>>> initialization.
>>>>>> Is there anything else inside the class that is by default set in
>>>>>> English?
>>>>>> I've found the Analyzers for other languages too but they where seem to
>>>>>> be
>>>>>> deprecated.. Moreover I use english and other languages, all together
>>>>>> in
>>>>>> my
>>>>>> project so I would like to ask if there is a way to use either the same
>>>>>> class analyzer for all of them, or analyzers of the same functionality
>>>>>> for
>>>>>> all the languages. Thanks in advance!
>>>>>>
>>>>>> Best regards,
>>>>>> Vicky
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Analyzer enquiry
Posted by Erick Erickson <er...@gmail.com>.
Nope, that should do it.
Best
Erick
On Mon, Mar 14, 2011 at 9:35 AM, Vasiliki Gkouta <vg...@csd.auth.gr> wrote:
> Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and use
> no stemmers. At the one analyzer I passed a german stop words set to the
> constructor and at the other one I passed an english stop words set. My
> question was if I have to call any other function of the german analyzer for
> it to be corrent.
>
> Thank you.
>
>
> Quoting Erick Erickson <er...@gmail.com>:
>
>> I don't understand what you're saying here. If you put a stemmer in the
>> constructor, you *are* using it. If you don't specify any stemmer at all,
>> you
>> still have to define different analyzers to use different stop word lists.
>>
>> Can you restate your question?
>>
>> Best
>> Erick
>>
>> On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta <vg...@csd.auth.gr>
>> wrote:
>>>
>>> Thanks a lot for your help Erick! About the fields you mentioned: If I
>>> don't
>>> use stemmers, except for the constructor argument related to the stop
>>> words,
>>> is there anything else that I have to modify?
>>>
>>> Thanks,
>>> Vicky
>>>
>>>
>>> Quoting Erick Erickson <er...@gmail.com>:
>>>
>>>> StandardAnalyzer works well for most European languages. The problem
>>>> will
>>>> be stemming. Applying stemming via English rules to non-English
>>>> languages
>>>> produces...er...interesting results.
>>>>
>>>> You can go ahead and create language-specific fields for each language
>>>> and
>>>> use StandardAnalyzer with the appropriate stopwords and stemming with
>>>> each,
>>>> this is a common approach.. The Snowball stemmer takes a language
>>>> parameter...
>>>>
>>>> You need to use specific analyzers for Chinese Japanese Korean (CJK)
>>>> documents
>>>> though.
>>>>
>>>> Hope that helps
>>>> Erick
>>>>
>>>> On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta <vg...@csd.auth.gr>
>>>> wrote:
>>>>>
>>>>> Hello everybody,
>>>>>
>>>>> I have an enquiry about StandardAnalyzer. Can I use it for other
>>>>> languages
>>>>> except from English? I give the right list of stop words at
>>>>> initialization.
>>>>> Is there anything else inside the class that is by default set in
>>>>> English?
>>>>> I've found the Analyzers for other languages too but they where seem to
>>>>> be
>>>>> deprecated.. Moreover I use english and other languages, all together
>>>>> in
>>>>> my
>>>>> project so I would like to ask if there is a way to use either the same
>>>>> class analyzer for all of them, or analyzers of the same functionality
>>>>> for
>>>>> all the languages. Thanks in advance!
>>>>>
>>>>> Best regards,
>>>>> Vicky
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Analyzer enquiry
Posted by Vasiliki Gkouta <vg...@csd.auth.gr>.
Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and
use no stemmers. At the one analyzer I passed a german stop words set
to the constructor and at the other one I passed an english stop words
set. My question was if I have to call any other function of the
german analyzer for it to be corrent.
Thank you.
Quoting Erick Erickson <er...@gmail.com>:
> I don't understand what you're saying here. If you put a stemmer in the
> constructor, you *are* using it. If you don't specify any stemmer at all, you
> still have to define different analyzers to use different stop word lists.
>
> Can you restate your question?
>
> Best
> Erick
>
> On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta <vg...@csd.auth.gr> wrote:
>> Thanks a lot for your help Erick! About the fields you mentioned: If I don't
>> use stemmers, except for the constructor argument related to the stop words,
>> is there anything else that I have to modify?
>>
>> Thanks,
>> Vicky
>>
>>
>> Quoting Erick Erickson <er...@gmail.com>:
>>
>>> StandardAnalyzer works well for most European languages. The problem will
>>> be stemming. Applying stemming via English rules to non-English languages
>>> produces...er...interesting results.
>>>
>>> You can go ahead and create language-specific fields for each language and
>>> use StandardAnalyzer with the appropriate stopwords and stemming with
>>> each,
>>> this is a common approach.. The Snowball stemmer takes a language
>>> parameter...
>>>
>>> You need to use specific analyzers for Chinese Japanese Korean (CJK)
>>> documents
>>> though.
>>>
>>> Hope that helps
>>> Erick
>>>
>>> On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta <vg...@csd.auth.gr>
>>> wrote:
>>>>
>>>> Hello everybody,
>>>>
>>>> I have an enquiry about StandardAnalyzer. Can I use it for other
>>>> languages
>>>> except from English? I give the right list of stop words at
>>>> initialization.
>>>> Is there anything else inside the class that is by default set in
>>>> English?
>>>> I've found the Analyzers for other languages too but they where seem to
>>>> be
>>>> deprecated.. Moreover I use english and other languages, all together in
>>>> my
>>>> project so I would like to ask if there is a way to use either the same
>>>> class analyzer for all of them, or analyzers of the same functionality
>>>> for
>>>> all the languages. Thanks in advance!
>>>>
>>>> Best regards,
>>>> Vicky
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Analyzer enquiry
Posted by Erick Erickson <er...@gmail.com>.
I don't understand what you're saying here. If you put a stemmer in the
constructor, you *are* using it. If you don't specify any stemmer at all, you
still have to define different analyzers to use different stop word lists.
Can you restate your question?
Best
Erick
On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta <vg...@csd.auth.gr> wrote:
> Thanks a lot for your help Erick! About the fields you mentioned: If I don't
> use stemmers, except for the constructor argument related to the stop words,
> is there anything else that I have to modify?
>
> Thanks,
> Vicky
>
>
> Quoting Erick Erickson <er...@gmail.com>:
>
>> StandardAnalyzer works well for most European languages. The problem will
>> be stemming. Applying stemming via English rules to non-English languages
>> produces...er...interesting results.
>>
>> You can go ahead and create language-specific fields for each language and
>> use StandardAnalyzer with the appropriate stopwords and stemming with
>> each,
>> this is a common approach.. The Snowball stemmer takes a language
>> parameter...
>>
>> You need to use specific analyzers for Chinese Japanese Korean (CJK)
>> documents
>> though.
>>
>> Hope that helps
>> Erick
>>
>> On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta <vg...@csd.auth.gr>
>> wrote:
>>>
>>> Hello everybody,
>>>
>>> I have an enquiry about StandardAnalyzer. Can I use it for other
>>> languages
>>> except from English? I give the right list of stop words at
>>> initialization.
>>> Is there anything else inside the class that is by default set in
>>> English?
>>> I've found the Analyzers for other languages too but they where seem to
>>> be
>>> deprecated.. Moreover I use english and other languages, all together in
>>> my
>>> project so I would like to ask if there is a way to use either the same
>>> class analyzer for all of them, or analyzers of the same functionality
>>> for
>>> all the languages. Thanks in advance!
>>>
>>> Best regards,
>>> Vicky
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Analyzer enquiry
Posted by Vasiliki Gkouta <vg...@csd.auth.gr>.
Thanks a lot for your help Erick! About the fields you mentioned: If I
don't use stemmers, except for the constructor argument related to the
stop words, is there anything else that I have to modify?
Thanks,
Vicky
Quoting Erick Erickson <er...@gmail.com>:
> StandardAnalyzer works well for most European languages. The problem will
> be stemming. Applying stemming via English rules to non-English languages
> produces...er...interesting results.
>
> You can go ahead and create language-specific fields for each language and
> use StandardAnalyzer with the appropriate stopwords and stemming with each,
> this is a common approach.. The Snowball stemmer takes a language
> parameter...
>
> You need to use specific analyzers for Chinese Japanese Korean (CJK)
> documents
> though.
>
> Hope that helps
> Erick
>
> On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta <vg...@csd.auth.gr> wrote:
>> Hello everybody,
>>
>> I have an enquiry about StandardAnalyzer. Can I use it for other languages
>> except from English? I give the right list of stop words at initialization.
>> Is there anything else inside the class that is by default set in English?
>> I've found the Analyzers for other languages too but they where seem to be
>> deprecated.. Moreover I use english and other languages, all together in my
>> project so I would like to ask if there is a way to use either the same
>> class analyzer for all of them, or analyzers of the same functionality for
>> all the languages. Thanks in advance!
>>
>> Best regards,
>> Vicky
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Analyzer enquiry
Posted by Erick Erickson <er...@gmail.com>.
StandardAnalyzer works well for most European languages. The problem will
be stemming. Applying stemming via English rules to non-English languages
produces...er...interesting results.
You can go ahead and create language-specific fields for each language and
use StandardAnalyzer with the appropriate stopwords and stemming with each,
this is a common approach.. The Snowball stemmer takes a language parameter...
You need to use specific analyzers for Chinese Japanese Korean (CJK) documents
though.
Hope that helps
Erick
On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta <vg...@csd.auth.gr> wrote:
> Hello everybody,
>
> I have an enquiry about StandardAnalyzer. Can I use it for other languages
> except from English? I give the right list of stop words at initialization.
> Is there anything else inside the class that is by default set in English?
> I've found the Analyzers for other languages too but they where seem to be
> deprecated.. Moreover I use english and other languages, all together in my
> project so I would like to ask if there is a way to use either the same
> class analyzer for all of them, or analyzers of the same functionality for
> all the languages. Thanks in advance!
>
> Best regards,
> Vicky
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org