You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Vasiliki Gkouta <vg...@csd.auth.gr> on 2011/03/14 00:23:56 UTC

Analyzer enquiry

Hello everybody,

I have an enquiry about StandardAnalyzer. Can I use it for other  
languages except from English? I give the right list of stop words at  
initialization. Is there anything else inside the class that is by  
default set in English?
I've found the Analyzers for other languages too but they where seem  
to be deprecated.. Moreover I use english and other languages, all  
together in my project so I would like to ask if there is a way to use  
either the same class analyzer for all of them, or analyzers of the  
same functionality for all the languages. Thanks in advance!

Best regards,
Vicky



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analyzer enquiry

Posted by Vasiliki Gkouta <vg...@csd.auth.gr>.

Thank you for your help!

Best Regards,
Vicky

Quoting Erick Erickson <er...@gmail.com>:

> Nope, that should do it.
>
> Best
> Erick
>
> On Mon, Mar 14, 2011 at 9:35 AM, Vasiliki Gkouta <vg...@csd.auth.gr> wrote:
>> Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and use
>> no stemmers. At the one analyzer I passed a german stop words set to the
>> constructor and at the other one I passed an english stop words set. My
>> question was if I have to call any other function of the german analyzer for
>> it to be corrent.
>>
>> Thank you.
>>
>>
>> Quoting Erick Erickson <er...@gmail.com>:
>>
>>> I don't understand what you're saying here. If you put a stemmer in the
>>> constructor, you *are* using it. If you don't specify any stemmer at all,
>>> you
>>> still have to define different analyzers to use different stop word lists.
>>>
>>> Can you restate your question?
>>>
>>> Best
>>> Erick
>>>
>>> On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta <vg...@csd.auth.gr>
>>> wrote:
>>>>
>>>> Thanks a lot for your help Erick! About the fields you mentioned: If I
>>>> don't
>>>> use stemmers, except for the constructor argument related to the stop
>>>> words,
>>>> is there anything else that I have to modify?
>>>>
>>>> Thanks,
>>>> Vicky
>>>>
>>>>
>>>> Quoting Erick Erickson <er...@gmail.com>:
>>>>
>>>>> StandardAnalyzer works well for most European languages. The problem
>>>>> will
>>>>> be stemming. Applying stemming via English rules to non-English
>>>>> languages
>>>>> produces...er...interesting results.
>>>>>
>>>>> You can go ahead and create language-specific fields for each language
>>>>> and
>>>>> use StandardAnalyzer with the appropriate stopwords and stemming with
>>>>> each,
>>>>> this is a common approach.. The Snowball stemmer takes a language
>>>>> parameter...
>>>>>
>>>>> You need to use specific analyzers for Chinese Japanese Korean (CJK)
>>>>> documents
>>>>> though.
>>>>>
>>>>> Hope that helps
>>>>> Erick
>>>>>
>>>>> On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta <vg...@csd.auth.gr>
>>>>> wrote:
>>>>>>
>>>>>> Hello everybody,
>>>>>>
>>>>>> I have an enquiry about StandardAnalyzer. Can I use it for other
>>>>>> languages
>>>>>> except from English? I give the right list of stop words at
>>>>>> initialization.
>>>>>> Is there anything else inside the class that is by default set in
>>>>>> English?
>>>>>> I've found the Analyzers for other languages too but they where seem to
>>>>>> be
>>>>>> deprecated.. Moreover I use english and other languages, all together
>>>>>> in
>>>>>> my
>>>>>> project so I would like to ask if there is a way to use either the same
>>>>>> class analyzer for all of them, or analyzers of the same functionality
>>>>>> for
>>>>>> all the languages. Thanks in advance!
>>>>>>
>>>>>> Best regards,
>>>>>> Vicky
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analyzer enquiry

Posted by Erick Erickson <er...@gmail.com>.

Nope, that should do it.

Best
Erick

On Mon, Mar 14, 2011 at 9:35 AM, Vasiliki Gkouta <vg...@csd.auth.gr> wrote:
> Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and use
> no stemmers. At the one analyzer I passed a german stop words set to the
> constructor and at the other one I passed an english stop words set. My
> question was if I have to call any other function of the german analyzer for
> it to be corrent.
>
> Thank you.
>
>
> Quoting Erick Erickson <er...@gmail.com>:
>
>> I don't understand what you're saying here. If you put a stemmer in the
>> constructor, you *are* using it. If you don't specify any stemmer at all,
>> you
>> still have to define different analyzers to use different stop word lists.
>>
>> Can you restate your question?
>>
>> Best
>> Erick
>>
>> On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta <vg...@csd.auth.gr>
>> wrote:
>>>
>>> Thanks a lot for your help Erick! About the fields you mentioned: If I
>>> don't
>>> use stemmers, except for the constructor argument related to the stop
>>> words,
>>> is there anything else that I have to modify?
>>>
>>> Thanks,
>>> Vicky
>>>
>>>
>>> Quoting Erick Erickson <er...@gmail.com>:
>>>
>>>> StandardAnalyzer works well for most European languages. The problem
>>>> will
>>>> be stemming. Applying stemming via English rules to non-English
>>>> languages
>>>> produces...er...interesting results.
>>>>
>>>> You can go ahead and create language-specific fields for each language
>>>> and
>>>> use StandardAnalyzer with the appropriate stopwords and stemming with
>>>> each,
>>>> this is a common approach.. The Snowball stemmer takes a language
>>>> parameter...
>>>>
>>>> You need to use specific analyzers for Chinese Japanese Korean (CJK)
>>>> documents
>>>> though.
>>>>
>>>> Hope that helps
>>>> Erick
>>>>
>>>> On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta <vg...@csd.auth.gr>
>>>> wrote:
>>>>>
>>>>> Hello everybody,
>>>>>
>>>>> I have an enquiry about StandardAnalyzer. Can I use it for other
>>>>> languages
>>>>> except from English? I give the right list of stop words at
>>>>> initialization.
>>>>> Is there anything else inside the class that is by default set in
>>>>> English?
>>>>> I've found the Analyzers for other languages too but they where seem to
>>>>> be
>>>>> deprecated.. Moreover I use english and other languages, all together
>>>>> in
>>>>> my
>>>>> project so I would like to ask if there is a way to use either the same
>>>>> class analyzer for all of them, or analyzers of the same functionality
>>>>> for
>>>>> all the languages. Thanks in advance!
>>>>>
>>>>> Best regards,
>>>>> Vicky
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analyzer enquiry

Posted by Vasiliki Gkouta <vg...@csd.auth.gr>.

Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and  
use no stemmers. At the one analyzer I passed a german stop words set  
to the constructor and at the other one I passed an english stop words  
set. My question was if I have to call any other function of the  
german analyzer for it to be corrent.

Thank you.


Quoting Erick Erickson <er...@gmail.com>:

> I don't understand what you're saying here. If you put a stemmer in the
> constructor, you *are* using it. If you don't specify any stemmer at all, you
> still have to define different analyzers to use different stop word lists.
>
> Can you restate your question?
>
> Best
> Erick
>
> On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta <vg...@csd.auth.gr> wrote:
>> Thanks a lot for your help Erick! About the fields you mentioned: If I don't
>> use stemmers, except for the constructor argument related to the stop words,
>> is there anything else that I have to modify?
>>
>> Thanks,
>> Vicky
>>
>>
>> Quoting Erick Erickson <er...@gmail.com>:
>>
>>> StandardAnalyzer works well for most European languages. The problem will
>>> be stemming. Applying stemming via English rules to non-English languages
>>> produces...er...interesting results.
>>>
>>> You can go ahead and create language-specific fields for each language and
>>> use StandardAnalyzer with the appropriate stopwords and stemming with
>>> each,
>>> this is a common approach.. The Snowball stemmer takes a language
>>> parameter...
>>>
>>> You need to use specific analyzers for Chinese Japanese Korean (CJK)
>>> documents
>>> though.
>>>
>>> Hope that helps
>>> Erick
>>>
>>> On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta <vg...@csd.auth.gr>
>>> wrote:
>>>>
>>>> Hello everybody,
>>>>
>>>> I have an enquiry about StandardAnalyzer. Can I use it for other
>>>> languages
>>>> except from English? I give the right list of stop words at
>>>> initialization.
>>>> Is there anything else inside the class that is by default set in
>>>> English?
>>>> I've found the Analyzers for other languages too but they where seem to
>>>> be
>>>> deprecated.. Moreover I use english and other languages, all together in
>>>> my
>>>> project so I would like to ask if there is a way to use either the same
>>>> class analyzer for all of them, or analyzers of the same functionality
>>>> for
>>>> all the languages. Thanks in advance!
>>>>
>>>> Best regards,
>>>> Vicky
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analyzer enquiry

Posted by Erick Erickson <er...@gmail.com>.

I don't understand what you're saying here. If you put a stemmer in the
constructor, you *are* using it. If you don't specify any stemmer at all, you
still have to define different analyzers to use different stop word lists.

Can you restate your question?

Best
Erick

On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta <vg...@csd.auth.gr> wrote:
> Thanks a lot for your help Erick! About the fields you mentioned: If I don't
> use stemmers, except for the constructor argument related to the stop words,
> is there anything else that I have to modify?
>
> Thanks,
> Vicky
>
>
> Quoting Erick Erickson <er...@gmail.com>:
>
>> StandardAnalyzer works well for most European languages. The problem will
>> be stemming. Applying stemming via English rules to non-English languages
>> produces...er...interesting results.
>>
>> You can go ahead and create language-specific fields for each language and
>> use StandardAnalyzer with the appropriate stopwords and stemming with
>> each,
>> this is a common approach.. The Snowball stemmer takes a language
>> parameter...
>>
>> You need to use specific analyzers for Chinese Japanese Korean (CJK)
>> documents
>> though.
>>
>> Hope that helps
>> Erick
>>
>> On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta <vg...@csd.auth.gr>
>> wrote:
>>>
>>> Hello everybody,
>>>
>>> I have an enquiry about StandardAnalyzer. Can I use it for other
>>> languages
>>> except from English? I give the right list of stop words at
>>> initialization.
>>> Is there anything else inside the class that is by default set in
>>> English?
>>> I've found the Analyzers for other languages too but they where seem to
>>> be
>>> deprecated.. Moreover I use english and other languages, all together in
>>> my
>>> project so I would like to ask if there is a way to use either the same
>>> class analyzer for all of them, or analyzers of the same functionality
>>> for
>>> all the languages. Thanks in advance!
>>>
>>> Best regards,
>>> Vicky
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analyzer enquiry

Posted by Vasiliki Gkouta <vg...@csd.auth.gr>.

Thanks a lot for your help Erick! About the fields you mentioned: If I  
don't use stemmers, except for the constructor argument related to the  
stop words, is there anything else that I have to modify?

Thanks,
Vicky


Quoting Erick Erickson <er...@gmail.com>:

> StandardAnalyzer works well for most European languages. The problem will
> be stemming. Applying stemming via English rules to non-English languages
> produces...er...interesting results.
>
> You can go ahead and create language-specific fields for each language and
> use StandardAnalyzer with the appropriate stopwords and stemming with each,
> this is a common approach.. The Snowball stemmer takes a language  
> parameter...
>
> You need to use specific analyzers for Chinese Japanese Korean (CJK)  
> documents
> though.
>
> Hope that helps
> Erick
>
> On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta <vg...@csd.auth.gr> wrote:
>> Hello everybody,
>>
>> I have an enquiry about StandardAnalyzer. Can I use it for other languages
>> except from English? I give the right list of stop words at initialization.
>> Is there anything else inside the class that is by default set in English?
>> I've found the Analyzers for other languages too but they where seem to be
>> deprecated.. Moreover I use english and other languages, all together in my
>> project so I would like to ask if there is a way to use either the same
>> class analyzer for all of them, or analyzers of the same functionality for
>> all the languages. Thanks in advance!
>>
>> Best regards,
>> Vicky
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Analyzer enquiry

Posted by Erick Erickson <er...@gmail.com>.

StandardAnalyzer works well for most European languages. The problem will
be stemming. Applying stemming via English rules to non-English languages
produces...er...interesting results.

You can go ahead and create language-specific fields for each language and
use StandardAnalyzer with the appropriate stopwords and stemming with each,
this is a common approach.. The Snowball stemmer takes a language parameter...

You need to use specific analyzers for Chinese Japanese Korean (CJK) documents
though.

Hope that helps
Erick

On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta <vg...@csd.auth.gr> wrote:
> Hello everybody,
>
> I have an enquiry about StandardAnalyzer. Can I use it for other languages
> except from English? I give the right list of stop words at initialization.
> Is there anything else inside the class that is by default set in English?
> I've found the Analyzers for other languages too but they where seem to be
> deprecated.. Moreover I use english and other languages, all together in my
> project so I would like to ask if there is a way to use either the same
> class analyzer for all of them, or analyzers of the same functionality for
> all the languages. Thanks in advance!
>
> Best regards,
> Vicky
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org