You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Poornima Jay <po...@rocketmail.com> on 2014/07/10 09:26:45 UTC
Korean Tokenizer in solr
Hi,
Anyone tried to implement korean language in solr 3.6.1. I define the field as below in my schema file but the fieldtype is not working.
<fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" >
<analyzer type="index">
<tokenizer class="solr.KoreanTokenizerFactory"/>
<filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true" bigrammable="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KoreanTokenizerFactory"/>
<filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false" bigrammable="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
</analyzer>
</fieldType>
Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype 'text_kr' specified on field product_name_kr
Regards,
Poornima
Re: Korean Tokenizer in solr
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
What happens if you have a new collection with absolute minimum in it
and then add the definition? Start from something like:
https://github.com/arafalov/simplest-solr-config .
Also, is there a long exception earlier in a log. It may have more clues.
Regards,
Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
On Mon, Jul 14, 2014 at 2:15 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> Yes, Below is my defined fieldtype
>
> <fieldType name="text_match_phrase_cjk" class="solr.TextField" positionIncrementGap="100">
> <analyzer type ="index">
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
> </analyzer>
> <analyzer type ="query">
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
> </analyzer>
> </fieldType>
>
> Please correct me if I am doing anything wrong here
>
> Regards,
> Poornima
>
>
> On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> You sure, it's not a spelling error or something other weird like
> that? Because Solr ships with that filter in it's example schema:
> <filter class="solr.CJKBigramFilterFactory"/>
>
> So, you can compare what you are doing differently with that.
>
> Regards,
> Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
>
> On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error
>> Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
>> I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working.
>>
>> Please advice.
>>
>> Regards,
>> Poornima
>>
>>
>> On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>
>>
>>
>> I would suggest you read through all 12 (?) articles in this series:
>> http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
>> . It will probably lay out most of the issues for you.
>>
>> And if you are starting, I would really suggest using the latest Solr
>> (4.9). A lot more people remember what the latest version has then
>> what was in 3.6. And, as the series above will tell you, some relevant
>> issues had been fixed in more recent Solr versions.
>>
>> Regards,
>> Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
>> <po...@rocketmail.com> wrote:
>>> Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one.
>>> Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately.
>>>
>>> I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean.
>>>
>>> <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>>> <analyzer>
>>> <tokenizer class="solr.CJKTokenizerFactory" />
>>> <filter class="solr.CJKWidthFilterFactory"/>
>>> <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>>> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
>>> <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
>>> <filter class="solr.ICUFoldingFilterFactory"/>
>>> <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>>> </analyzer>
>>> </fieldtype>
>>>
>>> So i tried to implement individual fieldtype for each language as below
>>>
>>> Chinese
>>> <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>>> <analyzer>
>>> <tokenizer class="solr.ICUTokenizerFactory"/>
>>> <filter class="solr.ICUFoldingFilterFactory"/>
>>> <filter class="solr.CJKWidthFilterFactory"/>
>>> <filter class="solr.CJKBigramFilterFactory"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>> Japanese
>>> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
>>> <analyzer>
>>> <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>>> <filter class="solr.JapaneseBaseFormFilterFactory"/>
>>> <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" />
>>> <filter class="solr.CJKWidthFilterFactory"/>
>>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" />
>>> <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>> Korean
>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>>> <analyzer type="index">
>>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>>> <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true" bigrammable="true"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>>> </analyzer>
>>> <analyzer type="query">
>>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>>> <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false" bigrammable="false"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>> I am really struck how to implement this. Please help me.
>>>
>>> Thanks,
>>> Poornima
>>>
>>>
>>>
>>> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>>
>>>
>>>
>>> I don't think Solr ships with Korean Tokenizer, does it?
>>>
>>> If you are using a 3rd party one, you need to give full class name,
>>> not just solr.Korean... And you need the library added in the lib
>>> statement in solrconfig.xml (at least in Solr 4).
>>>
>>> Regards,
>>> Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>>
>>>
>>>
>>> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
>>> <po...@rocketmail.com> wrote:
>>>> I have defined the fieldtype inside the fields section. When i checked the error log i found the below error
>>>>
>>>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>>>
>>>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>>>>
>>>>
>>>> Do i need to add any libraries for koreanTokenizer?
>>>>
>>>> Regards,
>>>> Poornima
>>>>
>>>>
>>>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> Double check your xml file that you don't - for example - define your
>>>> fieldType outside of fields section. Or maybe you have exception
>>>> earlier about some component in the type definition.
>>>>
>>>> This is not about Korean language, it seems. Something more
>>>> fundamentally about XML config.
>>>>
>>>> Regards,
>>>> Alex.
>>>> Personal website: http://www.outerthoughts.com/
>>>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>>>
>>>>
>>>>
>>>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>>>> <po...@rocketmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>>>>> as below in my schema file but the fieldtype is not working.
>>>>>
>>>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>>>>
>>>>> <analyzer type="index">
>>>>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>>> <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>>>> hasCNoun="true" bigrammable="true"/>
>>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>> words="stopwords_kr.txt"/>
>>>>> </analyzer>
>>>>> <analyzer type="query">
>>>>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>>> <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>>>> hasCNoun="false" bigrammable="false"/>
>>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>> words="stopwords_kr.txt"/>
>>>>> </analyzer>
>>>>> </fieldType>
>>>>>
>>>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>>>> 'text_kr' specified on field product_name_kr
>>>>>
>>>>> Regards,
>>>>> Poornima
>>>>>
Re: Korean Tokenizer in solr
Posted by Poornima Jay <po...@rocketmail.com>.
When I am trying to index the below error comes
java.io.FileNotFoundException: /home/searchuser/multicore/apac_content/data/tlog/tlog.0000000000000000000 (No such file or directory)
On Monday, 14 July 2014 2:07 PM, Poornima Jay <po...@rocketmail.com> wrote:
Yes, Below is my defined fieldtype
<fieldType name="text_match_phrase_cjk" class="solr.TextField" positionIncrementGap="100">
<analyzer type ="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
</analyzer>
<analyzer type ="query">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
</analyzer>
</fieldType>
Please correct me if I am doing anything wrong here
Regards,
Poornima
On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
You sure, it's not a spelling error or something other weird like
that? Because Solr ships with that filter in it's example schema:
<filter class="solr.CJKBigramFilterFactory"/>
So, you can compare what you are doing differently with that.
Regards,
Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error
> Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
> I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working.
>
> Please advice.
>
> Regards,
> Poornima
>
>
> On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> I would suggest you read through all 12 (?) articles in this series:
> http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
> . It will probably lay out most of the issues for you.
>
> And if you are starting, I would really suggest using the latest Solr
> (4.9). A lot more people remember what the latest version has then
> what was in 3.6. And, as the series above will tell you, some relevant
> issues had been fixed in more recent Solr versions.
>
> Regards,
> Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one.
>> Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately.
>>
>> I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean.
>>
>> <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>> <analyzer>
>> <tokenizer class="solr.CJKTokenizerFactory" />
>> <filter class="solr.CJKWidthFilterFactory"/>
>> <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
>> <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
>> <filter class="solr.ICUFoldingFilterFactory"/>
>> <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>> </analyzer>
>> </fieldtype>
>>
>> So i tried to implement individual fieldtype for each language as below
>>
>> Chinese
>> <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>> <analyzer>
>> <tokenizer class="solr.ICUTokenizerFactory"/>
>> <filter class="solr.ICUFoldingFilterFactory"/>
>> <filter class="solr.CJKWidthFilterFactory"/>
>> <filter class="solr.CJKBigramFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> Japanese
>> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
>> <analyzer>
>> <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>> <filter class="solr.JapaneseBaseFormFilterFactory"/>
>> <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" />
>> <filter class="solr.CJKWidthFilterFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" />
>> <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> Korean
>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>> <analyzer type="index">
>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>> <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true" bigrammable="true"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>> <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false" bigrammable="false"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>> </analyzer>
>> </fieldType>
>>
>> I am really struck how to implement this. Please help me.
>>
>> Thanks,
>> Poornima
>>
>>
>>
>> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>
>>
>>
>> I don't think Solr ships with Korean Tokenizer, does it?
>>
>> If you are using a 3rd party one, you need to give full class name,
>> not just solr.Korean... And you need the library added in the lib
>> statement in solrconfig.xml (at least in Solr 4).
>>
>> Regards,
>> Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
>> <po...@rocketmail.com> wrote:
>>> I have defined the fieldtype inside the fields section. When i checked the error log i found the below error
>>>
>>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>>
>>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>>>
>>>
>>> Do i need to add any libraries for koreanTokenizer?
>>>
>>> Regards,
>>> Poornima
>>>
>>>
>>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>>
>>>
>>>
>>> Double check your xml file that you don't - for example - define your
>>> fieldType outside of fields section. Or maybe you have exception
>>> earlier about some component in the type definition.
>>>
>>> This is not about Korean language, it seems. Something more
>>> fundamentally about XML config.
>>>
>>> Regards,
>>> Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>>
>>>
>>>
>>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>>> <po...@rocketmail.com> wrote:
>>>> Hi,
>>>>
>>>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>>>> as below in my schema file but the fieldtype is not working.
>>>>
>>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>>>
>>>> <analyzer type="index">
>>>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>> <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>>> hasCNoun="true" bigrammable="true"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_kr.txt"/>
>>>> </analyzer>
>>>> <analyzer type="query">
>>>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>> <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>>> hasCNoun="false" bigrammable="false"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_kr.txt"/>
>>>> </analyzer>
>>>> </fieldType>
>>>>
>>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>>> 'text_kr' specified on field product_name_kr
>>>>
>>>> Regards,
>>>> Poornima
>>>>
Re: Korean Tokenizer in solr
Posted by Poornima Jay <po...@rocketmail.com>.
Yes, Below is my defined fieldtype
<fieldType name="text_match_phrase_cjk" class="solr.TextField" positionIncrementGap="100">
<analyzer type ="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
</analyzer>
<analyzer type ="query">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
</analyzer>
</fieldType>
Please correct me if I am doing anything wrong here
Regards,
Poornima
On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
You sure, it's not a spelling error or something other weird like
that? Because Solr ships with that filter in it's example schema:
<filter class="solr.CJKBigramFilterFactory"/>
So, you can compare what you are doing differently with that.
Regards,
Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error
> Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
> I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working.
>
> Please advice.
>
> Regards,
> Poornima
>
>
> On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> I would suggest you read through all 12 (?) articles in this series:
> http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
> . It will probably lay out most of the issues for you.
>
> And if you are starting, I would really suggest using the latest Solr
> (4.9). A lot more people remember what the latest version has then
> what was in 3.6. And, as the series above will tell you, some relevant
> issues had been fixed in more recent Solr versions.
>
> Regards,
> Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one.
>> Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately.
>>
>> I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean.
>>
>> <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>> <analyzer>
>> <tokenizer class="solr.CJKTokenizerFactory" />
>> <filter class="solr.CJKWidthFilterFactory"/>
>> <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
>> <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
>> <filter class="solr.ICUFoldingFilterFactory"/>
>> <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>> </analyzer>
>> </fieldtype>
>>
>> So i tried to implement individual fieldtype for each language as below
>>
>> Chinese
>> <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>> <analyzer>
>> <tokenizer class="solr.ICUTokenizerFactory"/>
>> <filter class="solr.ICUFoldingFilterFactory"/>
>> <filter class="solr.CJKWidthFilterFactory"/>
>> <filter class="solr.CJKBigramFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> Japanese
>> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
>> <analyzer>
>> <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>> <filter class="solr.JapaneseBaseFormFilterFactory"/>
>> <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" />
>> <filter class="solr.CJKWidthFilterFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" />
>> <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> Korean
>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>> <analyzer type="index">
>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>> <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true" bigrammable="true"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>> <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false" bigrammable="false"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>> </analyzer>
>> </fieldType>
>>
>> I am really struck how to implement this. Please help me.
>>
>> Thanks,
>> Poornima
>>
>>
>>
>> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>
>>
>>
>> I don't think Solr ships with Korean Tokenizer, does it?
>>
>> If you are using a 3rd party one, you need to give full class name,
>> not just solr.Korean... And you need the library added in the lib
>> statement in solrconfig.xml (at least in Solr 4).
>>
>> Regards,
>> Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
>> <po...@rocketmail.com> wrote:
>>> I have defined the fieldtype inside the fields section. When i checked the error log i found the below error
>>>
>>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>>
>>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>>>
>>>
>>> Do i need to add any libraries for koreanTokenizer?
>>>
>>> Regards,
>>> Poornima
>>>
>>>
>>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>>
>>>
>>>
>>> Double check your xml file that you don't - for example - define your
>>> fieldType outside of fields section. Or maybe you have exception
>>> earlier about some component in the type definition.
>>>
>>> This is not about Korean language, it seems. Something more
>>> fundamentally about XML config.
>>>
>>> Regards,
>>> Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>>
>>>
>>>
>>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>>> <po...@rocketmail.com> wrote:
>>>> Hi,
>>>>
>>>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>>>> as below in my schema file but the fieldtype is not working.
>>>>
>>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>>>
>>>> <analyzer type="index">
>>>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>> <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>>> hasCNoun="true" bigrammable="true"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_kr.txt"/>
>>>> </analyzer>
>>>> <analyzer type="query">
>>>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>> <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>>> hasCNoun="false" bigrammable="false"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_kr.txt"/>
>>>> </analyzer>
>>>> </fieldType>
>>>>
>>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>>> 'text_kr' specified on field product_name_kr
>>>>
>>>> Regards,
>>>> Poornima
>>>>
Re: Korean Tokenizer in solr
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
You sure, it's not a spelling error or something other weird like
that? Because Solr ships with that filter in it's example schema:
<filter class="solr.CJKBigramFilterFactory"/>
So, you can compare what you are doing differently with that.
Regards,
Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error
> Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
> I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working.
>
> Please advice.
>
> Regards,
> Poornima
>
>
> On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> I would suggest you read through all 12 (?) articles in this series:
> http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
> . It will probably lay out most of the issues for you.
>
> And if you are starting, I would really suggest using the latest Solr
> (4.9). A lot more people remember what the latest version has then
> what was in 3.6. And, as the series above will tell you, some relevant
> issues had been fixed in more recent Solr versions.
>
> Regards,
> Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one.
>> Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately.
>>
>> I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean.
>>
>> <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>> <analyzer>
>> <tokenizer class="solr.CJKTokenizerFactory" />
>> <filter class="solr.CJKWidthFilterFactory"/>
>> <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
>> <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
>> <filter class="solr.ICUFoldingFilterFactory"/>
>> <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>> </analyzer>
>> </fieldtype>
>>
>> So i tried to implement individual fieldtype for each language as below
>>
>> Chinese
>> <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>> <analyzer>
>> <tokenizer class="solr.ICUTokenizerFactory"/>
>> <filter class="solr.ICUFoldingFilterFactory"/>
>> <filter class="solr.CJKWidthFilterFactory"/>
>> <filter class="solr.CJKBigramFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> Japanese
>> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
>> <analyzer>
>> <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>> <filter class="solr.JapaneseBaseFormFilterFactory"/>
>> <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" />
>> <filter class="solr.CJKWidthFilterFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" />
>> <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> Korean
>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>> <analyzer type="index">
>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>> <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true" bigrammable="true"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>> <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false" bigrammable="false"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>> </analyzer>
>> </fieldType>
>>
>> I am really struck how to implement this. Please help me.
>>
>> Thanks,
>> Poornima
>>
>>
>>
>> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>
>>
>>
>> I don't think Solr ships with Korean Tokenizer, does it?
>>
>> If you are using a 3rd party one, you need to give full class name,
>> not just solr.Korean... And you need the library added in the lib
>> statement in solrconfig.xml (at least in Solr 4).
>>
>> Regards,
>> Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
>> <po...@rocketmail.com> wrote:
>>> I have defined the fieldtype inside the fields section. When i checked the error log i found the below error
>>>
>>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>>
>>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>>>
>>>
>>> Do i need to add any libraries for koreanTokenizer?
>>>
>>> Regards,
>>> Poornima
>>>
>>>
>>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>>
>>>
>>>
>>> Double check your xml file that you don't - for example - define your
>>> fieldType outside of fields section. Or maybe you have exception
>>> earlier about some component in the type definition.
>>>
>>> This is not about Korean language, it seems. Something more
>>> fundamentally about XML config.
>>>
>>> Regards,
>>> Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>>
>>>
>>>
>>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>>> <po...@rocketmail.com> wrote:
>>>> Hi,
>>>>
>>>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>>>> as below in my schema file but the fieldtype is not working.
>>>>
>>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>>>
>>>> <analyzer type="index">
>>>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>> <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>>> hasCNoun="true" bigrammable="true"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_kr.txt"/>
>>>> </analyzer>
>>>> <analyzer type="query">
>>>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>> <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>>> hasCNoun="false" bigrammable="false"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_kr.txt"/>
>>>> </analyzer>
>>>> </fieldType>
>>>>
>>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>>> 'text_kr' specified on field product_name_kr
>>>>
>>>> Regards,
>>>> Poornima
>>>>
Re: Korean Tokenizer in solr
Posted by Poornima Jay <po...@rocketmail.com>.
I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error
Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working.
Please advice.
Regards,
Poornima
On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
I would suggest you read through all 12 (?) articles in this series:
http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
. It will probably lay out most of the issues for you.
And if you are starting, I would really suggest using the latest Solr
(4.9). A lot more people remember what the latest version has then
what was in 3.6. And, as the series above will tell you, some relevant
issues had been fixed in more recent Solr versions.
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one.
> Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately.
>
> I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean.
>
> <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
> <analyzer>
> <tokenizer class="solr.CJKTokenizerFactory" />
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
> <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
> <filter class="solr.ICUFoldingFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
> </analyzer>
> </fieldtype>
>
> So i tried to implement individual fieldtype for each language as below
>
> Chinese
> <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
> <analyzer>
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.ICUFoldingFilterFactory"/>
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory"/>
> </analyzer>
> </fieldType>
>
> Japanese
> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
> <analyzer>
> <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
> <filter class="solr.JapaneseBaseFormFilterFactory"/>
> <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" />
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" />
> <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
>
> Korean
> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
> <analyzer type="index">
> <tokenizer class="solr.KoreanTokenizerFactory"/>
> <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true" bigrammable="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.KoreanTokenizerFactory"/>
> <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false" bigrammable="false"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
> </analyzer>
> </fieldType>
>
> I am really struck how to implement this. Please help me.
>
> Thanks,
> Poornima
>
>
>
> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> I don't think Solr ships with Korean Tokenizer, does it?
>
> If you are using a 3rd party one, you need to give full class name,
> not just solr.Korean... And you need the library added in the lib
> statement in solrconfig.xml (at least in Solr 4).
>
> Regards,
> Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> I have defined the fieldtype inside the fields section. When i checked the error log i found the below error
>>
>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>
>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>>
>>
>> Do i need to add any libraries for koreanTokenizer?
>>
>> Regards,
>> Poornima
>>
>>
>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>
>>
>>
>> Double check your xml file that you don't - for example - define your
>> fieldType outside of fields section. Or maybe you have exception
>> earlier about some component in the type definition.
>>
>> This is not about Korean language, it seems. Something more
>> fundamentally about XML config.
>>
>> Regards,
>> Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>> <po...@rocketmail.com> wrote:
>>> Hi,
>>>
>>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>>> as below in my schema file but the fieldtype is not working.
>>>
>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>>
>>> <analyzer type="index">
>>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>>> <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>> hasCNoun="true" bigrammable="true"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords_kr.txt"/>
>>> </analyzer>
>>> <analyzer type="query">
>>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>>> <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>> hasCNoun="false" bigrammable="false"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords_kr.txt"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>> 'text_kr' specified on field product_name_kr
>>>
>>> Regards,
>>> Poornima
>>>
Re: Korean Tokenizer in solr
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I would suggest you read through all 12 (?) articles in this series:
http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
. It will probably lay out most of the issues for you.
And if you are starting, I would really suggest using the latest Solr
(4.9). A lot more people remember what the latest version has then
what was in 3.6. And, as the series above will tell you, some relevant
issues had been fixed in more recent Solr versions.
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one.
> Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately.
>
> I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean.
>
> <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
> <analyzer>
> <tokenizer class="solr.CJKTokenizerFactory" />
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
> <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
> <filter class="solr.ICUFoldingFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
> </analyzer>
> </fieldtype>
>
> So i tried to implement individual fieldtype for each language as below
>
> Chinese
> <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
> <analyzer>
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.ICUFoldingFilterFactory"/>
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory"/>
> </analyzer>
> </fieldType>
>
> Japanese
> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
> <analyzer>
> <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
> <filter class="solr.JapaneseBaseFormFilterFactory"/>
> <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" />
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" />
> <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
>
> Korean
> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
> <analyzer type="index">
> <tokenizer class="solr.KoreanTokenizerFactory"/>
> <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true" bigrammable="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.KoreanTokenizerFactory"/>
> <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false" bigrammable="false"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
> </analyzer>
> </fieldType>
>
> I am really struck how to implement this. Please help me.
>
> Thanks,
> Poornima
>
>
>
> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> I don't think Solr ships with Korean Tokenizer, does it?
>
> If you are using a 3rd party one, you need to give full class name,
> not just solr.Korean... And you need the library added in the lib
> statement in solrconfig.xml (at least in Solr 4).
>
> Regards,
> Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> I have defined the fieldtype inside the fields section. When i checked the error log i found the below error
>>
>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>
>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>>
>>
>> Do i need to add any libraries for koreanTokenizer?
>>
>> Regards,
>> Poornima
>>
>>
>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>
>>
>>
>> Double check your xml file that you don't - for example - define your
>> fieldType outside of fields section. Or maybe you have exception
>> earlier about some component in the type definition.
>>
>> This is not about Korean language, it seems. Something more
>> fundamentally about XML config.
>>
>> Regards,
>> Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>> <po...@rocketmail.com> wrote:
>>> Hi,
>>>
>>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>>> as below in my schema file but the fieldtype is not working.
>>>
>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>>
>>> <analyzer type="index">
>>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>>> <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>> hasCNoun="true" bigrammable="true"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords_kr.txt"/>
>>> </analyzer>
>>> <analyzer type="query">
>>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>>> <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>> hasCNoun="false" bigrammable="false"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords_kr.txt"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>> 'text_kr' specified on field product_name_kr
>>>
>>> Regards,
>>> Poornima
>>>
Re: Korean Tokenizer in solr
Posted by Poornima Jay <po...@rocketmail.com>.
Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one.
Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately.
I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean.
<fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
<analyzer>
<tokenizer class="solr.CJKTokenizerFactory" />
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
<filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
</analyzer>
</fieldtype>
So i tried to implement individual fieldtype for each language as below
Chinese
<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
</fieldType>
Japanese
<fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer>
<tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
<filter class="solr.JapaneseBaseFormFilterFactory"/>
<filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" />
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" />
<filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Korean
<fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
<analyzer type="index">
<tokenizer class="solr.KoreanTokenizerFactory"/>
<filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true" bigrammable="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KoreanTokenizerFactory"/>
<filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false" bigrammable="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
</analyzer>
</fieldType>
I am really struck how to implement this. Please help me.
Thanks,
Poornima
On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
I don't think Solr ships with Korean Tokenizer, does it?
If you are using a 3rd party one, you need to give full class name,
not just solr.Korean... And you need the library added in the lib
statement in solrconfig.xml (at least in Solr 4).
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> I have defined the fieldtype inside the fields section. When i checked the error log i found the below error
>
> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>
> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>
>
> Do i need to add any libraries for koreanTokenizer?
>
> Regards,
> Poornima
>
>
> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> Double check your xml file that you don't - for example - define your
> fieldType outside of fields section. Or maybe you have exception
> earlier about some component in the type definition.
>
> This is not about Korean language, it seems. Something more
> fundamentally about XML config.
>
> Regards,
> Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> Hi,
>>
>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>> as below in my schema file but the fieldtype is not working.
>>
>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>
>> <analyzer type="index">
>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>> <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>> hasCNoun="true" bigrammable="true"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_kr.txt"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>> <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>> hasCNoun="false" bigrammable="false"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_kr.txt"/>
>> </analyzer>
>> </fieldType>
>>
>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>> 'text_kr' specified on field product_name_kr
>>
>> Regards,
>> Poornima
>>
Re: Korean Tokenizer in solr
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I don't think Solr ships with Korean Tokenizer, does it?
If you are using a 3rd party one, you need to give full class name,
not just solr.Korean... And you need the library added in the lib
statement in solrconfig.xml (at least in Solr 4).
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> I have defined the fieldtype inside the fields section. When i checked the error log i found the below error
>
> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>
> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>
>
> Do i need to add any libraries for koreanTokenizer?
>
> Regards,
> Poornima
>
>
> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> Double check your xml file that you don't - for example - define your
> fieldType outside of fields section. Or maybe you have exception
> earlier about some component in the type definition.
>
> This is not about Korean language, it seems. Something more
> fundamentally about XML config.
>
> Regards,
> Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> Hi,
>>
>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>> as below in my schema file but the fieldtype is not working.
>>
>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>
>> <analyzer type="index">
>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>> <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>> hasCNoun="true" bigrammable="true"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_kr.txt"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.KoreanTokenizerFactory"/>
>> <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>> hasCNoun="false" bigrammable="false"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_kr.txt"/>
>> </analyzer>
>> </fieldType>
>>
>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>> 'text_kr' specified on field product_name_kr
>>
>> Regards,
>> Poornima
>>
Re: Korean Tokenizer in solr
Posted by Poornima Jay <po...@rocketmail.com>.
I have defined the fieldtype inside the fields section. When i checked the error log i found the below error
Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
Do i need to add any libraries for koreanTokenizer?
Regards,
Poornima
On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
Double check your xml file that you don't - for example - define your
fieldType outside of fields section. Or maybe you have exception
earlier about some component in the type definition.
This is not about Korean language, it seems. Something more
fundamentally about XML config.
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> Hi,
>
> Anyone tried to implement korean language in solr 3.6.1. I define the field
> as below in my schema file but the fieldtype is not working.
>
> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>
> <analyzer type="index">
> <tokenizer class="solr.KoreanTokenizerFactory"/>
> <filter class="solr.KoreanFilterFactory" hasOrigin="true"
> hasCNoun="true" bigrammable="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_kr.txt"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.KoreanTokenizerFactory"/>
> <filter class="solr.KoreanFilterFactory" hasOrigin="false"
> hasCNoun="false" bigrammable="false"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_kr.txt"/>
> </analyzer>
> </fieldType>
>
> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
> 'text_kr' specified on field product_name_kr
>
> Regards,
> Poornima
>
Re: Korean Tokenizer in solr
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Double check your xml file that you don't - for example - define your
fieldType outside of fields section. Or maybe you have exception
earlier about some component in the type definition.
This is not about Korean language, it seems. Something more
fundamentally about XML config.
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> Hi,
>
> Anyone tried to implement korean language in solr 3.6.1. I define the field
> as below in my schema file but the fieldtype is not working.
>
> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>
> <analyzer type="index">
> <tokenizer class="solr.KoreanTokenizerFactory"/>
> <filter class="solr.KoreanFilterFactory" hasOrigin="true"
> hasCNoun="true" bigrammable="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_kr.txt"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.KoreanTokenizerFactory"/>
> <filter class="solr.KoreanFilterFactory" hasOrigin="false"
> hasCNoun="false" bigrammable="false"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_kr.txt"/>
> </analyzer>
> </fieldType>
>
> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
> 'text_kr' specified on field product_name_kr
>
> Regards,
> Poornima
>