You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Poornima Jay <po...@rocketmail.com> on 2014/07/10 09:26:45 UTC

Korean Tokenizer in solr

Hi,

Anyone tried to implement korean language in solr 3.6.1. I define the field as below in my schema file but the fieldtype is not working.

<fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" >
      <analyzer type="index">
        <tokenizer class="solr.KoreanTokenizerFactory"/>
        <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true"  bigrammable="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KoreanTokenizerFactory"/>
        <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false"  bigrammable="false"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
      </analyzer>      
    </fieldType>
    
Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype 'text_kr' specified on field product_name_kr

Regards,
Poornima

Re: Korean Tokenizer in solr

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

What happens if you have a new collection with absolute minimum in it
and then add the definition? Start from something like:
https://github.com/arafalov/simplest-solr-config .

Also, is there a long exception earlier in a log. It may have more clues.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Mon, Jul 14, 2014 at 2:15 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> Yes, Below is my defined fieldtype
>
> <fieldType name="text_match_phrase_cjk" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type ="index">
>          <tokenizer class="solr.ICUTokenizerFactory"/>
>          <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
>          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
>       </analyzer>
>       <analyzer type ="query">
>          <tokenizer class="solr.ICUTokenizerFactory"/>
>          <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
>          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
>       </analyzer>
>    </fieldType>
>
> Please correct me if I am doing anything wrong here
>
> Regards,
> Poornima
>
>
> On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> You sure, it's not a spelling error or something other weird like
> that? Because Solr ships with that filter in it's example schema:
>         <filter class="solr.CJKBigramFilterFactory"/>
>
> So, you can compare what you are doing differently with that.
>
> Regards,
>    Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
>
> On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error
>> Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
>> I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working.
>>
>> Please advice.
>>
>> Regards,
>> Poornima
>>
>>
>> On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>
>>
>>
>> I would suggest you read through all 12 (?) articles in this series:
>> http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
>> . It will probably lay out most of the issues for you.
>>
>> And if you are starting, I would really suggest using the latest Solr
>> (4.9). A lot more people remember what the latest version has then
>> what was in 3.6. And, as the series above will tell you, some relevant
>> issues had been fixed in more recent Solr versions.
>>
>> Regards,
>>    Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
>> <po...@rocketmail.com> wrote:
>>> Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one.
>>> Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately.
>>>
>>> I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean.
>>>
>>> <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>>>      <analyzer>
>>>         <tokenizer class="solr.CJKTokenizerFactory" />
>>>         <filter class="solr.CJKWidthFilterFactory"/>
>>>         <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>>>         <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
>>>         <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
>>>         <filter class="solr.ICUFoldingFilterFactory"/>
>>>         <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>>>       </analyzer>
>>>     </fieldtype>
>>>
>>> So i tried to implement individual fieldtype for each language as below
>>>
>>> Chinese
>>>  <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>>>      <analyzer>
>>>          <tokenizer class="solr.ICUTokenizerFactory"/>
>>>            <filter class="solr.ICUFoldingFilterFactory"/>
>>>            <filter class="solr.CJKWidthFilterFactory"/>
>>>            <filter class="solr.CJKBigramFilterFactory"/>
>>>        </analyzer>
>>>     </fieldType>
>>>
>>> Japanese
>>> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
>>>    <analyzer>
>>>      <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>>>       <filter class="solr.JapaneseBaseFormFilterFactory"/>
>>>       <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" />
>>>       <filter class="solr.CJKWidthFilterFactory"/>
>>>       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" />
>>>       <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>    </analyzer>
>>> </fieldType>
>>>
>>> Korean
>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>>>       <analyzer type="index">
>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true"  bigrammable="true"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>>>       </analyzer>
>>>       <analyzer type="query">
>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false"  bigrammable="false"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>>>       </analyzer>
>>>     </fieldType>
>>>
>>> I am really struck how to implement this. Please help me.
>>>
>>> Thanks,
>>> Poornima
>>>
>>>
>>>
>>> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>>
>>>
>>>
>>> I don't think Solr ships with Korean Tokenizer, does it?
>>>
>>> If you are using a 3rd party one, you need to give full class name,
>>> not just solr.Korean... And you need the library added in the lib
>>> statement in solrconfig.xml (at least in Solr 4).
>>>
>>> Regards,
>>>    Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>>
>>>
>>>
>>> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
>>> <po...@rocketmail.com> wrote:
>>>> I have defined the fieldtype inside the fields section.  When i checked the error log i found the below error
>>>>
>>>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>>>
>>>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>>>>
>>>>
>>>> Do i need to add any libraries for koreanTokenizer?
>>>>
>>>> Regards,
>>>> Poornima
>>>>
>>>>
>>>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> Double check your xml file that you don't - for example - define your
>>>> fieldType outside of fields section. Or maybe you have exception
>>>> earlier about some component in the type definition.
>>>>
>>>> This is not about Korean language, it seems. Something more
>>>> fundamentally about XML config.
>>>>
>>>> Regards,
>>>>    Alex.
>>>> Personal website: http://www.outerthoughts.com/
>>>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>>>
>>>>
>>>>
>>>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>>>> <po...@rocketmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>>>>> as below in my schema file but the fieldtype is not working.
>>>>>
>>>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>>>>
>>>>>       <analyzer type="index">
>>>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>>>> hasCNoun="true"  bigrammable="true"/>
>>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>> words="stopwords_kr.txt"/>
>>>>>       </analyzer>
>>>>>       <analyzer type="query">
>>>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>>>> hasCNoun="false"  bigrammable="false"/>
>>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>> words="stopwords_kr.txt"/>
>>>>>       </analyzer>
>>>>>     </fieldType>
>>>>>
>>>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>>>> 'text_kr' specified on field product_name_kr
>>>>>
>>>>> Regards,
>>>>> Poornima
>>>>>

Re: Korean Tokenizer in solr

Posted by Poornima Jay <po...@rocketmail.com>.

When I am trying to index the below error comes

java.io.FileNotFoundException: /home/searchuser/multicore/apac_content/data/tlog/tlog.0000000000000000000 (No such file or directory)





On Monday, 14 July 2014 2:07 PM, Poornima Jay <po...@rocketmail.com> wrote:
 


Yes, Below is my defined fieldtype

<fieldType name="text_match_phrase_cjk" class="solr.TextField" positionIncrementGap="100">
      <analyzer type ="index">
         <tokenizer class="solr.ICUTokenizerFactory"/>
         <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
      </analyzer>
      <analyzer type ="query">
         <tokenizer class="solr.ICUTokenizerFactory"/>
         <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
      </analyzer>
   </fieldType>

Please correct me if I am doing anything wrong here

Regards,
Poornima



On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:



You sure, it's not a spelling error or something other weird like
that? Because Solr ships with that filter in it's example schema:
        <filter class="solr.CJKBigramFilterFactory"/>

So, you can compare what you are doing differently with that.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853



On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error
> Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
> I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working.
>
> Please advice.
>
> Regards,
> Poornima
>
>
> On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> I would suggest you read through all 12 (?) articles in this series:
> http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
> . It will probably lay out most of the issues for you.
>
> And if you are starting, I would really suggest using the latest Solr
> (4.9). A lot more people remember what the latest version has then
> what was in 3.6. And, as the series above will tell you, some relevant
> issues had been fixed in more recent Solr versions.
>
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one.
>> Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately.
>>
>> I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean.
>>
>> <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>>      <analyzer>
>>         <tokenizer class="solr.CJKTokenizerFactory" />
>>         <filter class="solr.CJKWidthFilterFactory"/>
>>         <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>>         <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
>>         <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
>>         <filter class="solr.ICUFoldingFilterFactory"/>
>>         <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>>       </analyzer>
>>     </fieldtype>
>>
>> So i tried to implement individual fieldtype for each language as below
>>
>> Chinese
>>  <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>>      <analyzer>
>>          <tokenizer class="solr.ICUTokenizerFactory"/>
>>            <filter class="solr.ICUFoldingFilterFactory"/>
>>            <filter class="solr.CJKWidthFilterFactory"/>
>>            <filter class="solr.CJKBigramFilterFactory"/>
>>        </analyzer>
>>     </fieldType>
>>
>> Japanese
>> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
>>    <analyzer>
>>      <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>>       <filter class="solr.JapaneseBaseFormFilterFactory"/>
>>       <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" />
>>       <filter class="solr.CJKWidthFilterFactory"/>
>>       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" />
>>       <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>    </analyzer>
>> </fieldType>
>>
>> Korean
>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>>       <analyzer type="index">
>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true"  bigrammable="true"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false"  bigrammable="false"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>>       </analyzer>
>>     </fieldType>
>>
>> I am really struck how to implement this. Please help me.
>>
>> Thanks,
>> Poornima
>>
>>
>>
>> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>
>>
>>
>> I don't think Solr ships with Korean Tokenizer, does it?
>>
>> If you are using a 3rd party one, you need to give full class name,
>> not just solr.Korean... And you need the library added in the lib
>> statement in solrconfig.xml (at least in Solr 4).
>>
>> Regards,
>>    Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
>> <po...@rocketmail.com> wrote:
>>> I have defined the fieldtype inside the fields section.  When i checked the error log i found the below error
>>>
>>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>>
>>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>>>
>>>
>>> Do i need to add any libraries for koreanTokenizer?
>>>
>>> Regards,
>>> Poornima
>>>
>>>
>>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>>
>>>
>>>
>>> Double check your xml file that you don't - for example - define your
>>> fieldType outside of fields section. Or maybe you have exception
>>> earlier about some component in the type definition.
>>>
>>> This is not about Korean language, it seems. Something more
>>> fundamentally about XML config.
>>>
>>> Regards,
>>>    Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>>
>>>
>>>
>>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>>> <po...@rocketmail.com> wrote:
>>>> Hi,
>>>>
>>>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>>>> as below in my schema file but the fieldtype is not working.
>>>>
>>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>>>
>>>>       <analyzer type="index">
>>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>>> hasCNoun="true"  bigrammable="true"/>
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_kr.txt"/>
>>>>       </analyzer>
>>>>       <analyzer type="query">
>>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>>> hasCNoun="false"  bigrammable="false"/>
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_kr.txt"/>
>>>>       </analyzer>
>>>>     </fieldType>
>>>>
>>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>>> 'text_kr' specified on field product_name_kr
>>>>
>>>> Regards,
>>>> Poornima
>>>>

Re: Korean Tokenizer in solr

Posted by Poornima Jay <po...@rocketmail.com>.

Yes, Below is my defined fieldtype

<fieldType name="text_match_phrase_cjk" class="solr.TextField" positionIncrementGap="100">
      <analyzer type ="index">
         <tokenizer class="solr.ICUTokenizerFactory"/>
         <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
      </analyzer>
      <analyzer type ="query">
         <tokenizer class="solr.ICUTokenizerFactory"/>
         <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
      </analyzer>
   </fieldType>

Please correct me if I am doing anything wrong here

Regards,
Poornima


On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
 


You sure, it's not a spelling error or something other weird like
that? Because Solr ships with that filter in it's example schema:
        <filter class="solr.CJKBigramFilterFactory"/>

So, you can compare what you are doing differently with that.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853



On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error
> Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
> I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working.
>
> Please advice.
>
> Regards,
> Poornima
>
>
> On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> I would suggest you read through all 12 (?) articles in this series:
> http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
> . It will probably lay out most of the issues for you.
>
> And if you are starting, I would really suggest using the latest Solr
> (4.9). A lot more people remember what the latest version has then
> what was in 3.6. And, as the series above will tell you, some relevant
> issues had been fixed in more recent Solr versions.
>
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one.
>> Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately.
>>
>> I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean.
>>
>> <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>>      <analyzer>
>>         <tokenizer class="solr.CJKTokenizerFactory" />
>>         <filter class="solr.CJKWidthFilterFactory"/>
>>         <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>>         <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
>>         <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
>>         <filter class="solr.ICUFoldingFilterFactory"/>
>>         <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>>       </analyzer>
>>     </fieldtype>
>>
>> So i tried to implement individual fieldtype for each language as below
>>
>> Chinese
>>  <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>>      <analyzer>
>>          <tokenizer class="solr.ICUTokenizerFactory"/>
>>            <filter class="solr.ICUFoldingFilterFactory"/>
>>            <filter class="solr.CJKWidthFilterFactory"/>
>>            <filter class="solr.CJKBigramFilterFactory"/>
>>        </analyzer>
>>     </fieldType>
>>
>> Japanese
>> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
>>    <analyzer>
>>      <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>>       <filter class="solr.JapaneseBaseFormFilterFactory"/>
>>       <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" />
>>       <filter class="solr.CJKWidthFilterFactory"/>
>>       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" />
>>       <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>    </analyzer>
>> </fieldType>
>>
>> Korean
>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>>       <analyzer type="index">
>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true"  bigrammable="true"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false"  bigrammable="false"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>>       </analyzer>
>>     </fieldType>
>>
>> I am really struck how to implement this. Please help me.
>>
>> Thanks,
>> Poornima
>>
>>
>>
>> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>
>>
>>
>> I don't think Solr ships with Korean Tokenizer, does it?
>>
>> If you are using a 3rd party one, you need to give full class name,
>> not just solr.Korean... And you need the library added in the lib
>> statement in solrconfig.xml (at least in Solr 4).
>>
>> Regards,
>>    Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
>> <po...@rocketmail.com> wrote:
>>> I have defined the fieldtype inside the fields section.  When i checked the error log i found the below error
>>>
>>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>>
>>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>>>
>>>
>>> Do i need to add any libraries for koreanTokenizer?
>>>
>>> Regards,
>>> Poornima
>>>
>>>
>>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>>
>>>
>>>
>>> Double check your xml file that you don't - for example - define your
>>> fieldType outside of fields section. Or maybe you have exception
>>> earlier about some component in the type definition.
>>>
>>> This is not about Korean language, it seems. Something more
>>> fundamentally about XML config.
>>>
>>> Regards,
>>>    Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>>
>>>
>>>
>>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>>> <po...@rocketmail.com> wrote:
>>>> Hi,
>>>>
>>>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>>>> as below in my schema file but the fieldtype is not working.
>>>>
>>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>>>
>>>>       <analyzer type="index">
>>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>>> hasCNoun="true"  bigrammable="true"/>
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_kr.txt"/>
>>>>       </analyzer>
>>>>       <analyzer type="query">
>>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>>> hasCNoun="false"  bigrammable="false"/>
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_kr.txt"/>
>>>>       </analyzer>
>>>>     </fieldType>
>>>>
>>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>>> 'text_kr' specified on field product_name_kr
>>>>
>>>> Regards,
>>>> Poornima
>>>>

Re: Korean Tokenizer in solr

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

You sure, it's not a spelling error or something other weird like
that? Because Solr ships with that filter in it's example schema:
        <filter class="solr.CJKBigramFilterFactory"/>

So, you can compare what you are doing differently with that.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error
> Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
> I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working.
>
> Please advice.
>
> Regards,
> Poornima
>
>
> On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> I would suggest you read through all 12 (?) articles in this series:
> http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
> . It will probably lay out most of the issues for you.
>
> And if you are starting, I would really suggest using the latest Solr
> (4.9). A lot more people remember what the latest version has then
> what was in 3.6. And, as the series above will tell you, some relevant
> issues had been fixed in more recent Solr versions.
>
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one.
>> Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately.
>>
>> I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean.
>>
>> <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>>      <analyzer>
>>         <tokenizer class="solr.CJKTokenizerFactory" />
>>         <filter class="solr.CJKWidthFilterFactory"/>
>>         <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>>         <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
>>         <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
>>         <filter class="solr.ICUFoldingFilterFactory"/>
>>         <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>>       </analyzer>
>>     </fieldtype>
>>
>> So i tried to implement individual fieldtype for each language as below
>>
>> Chinese
>>  <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>>      <analyzer>
>>          <tokenizer class="solr.ICUTokenizerFactory"/>
>>            <filter class="solr.ICUFoldingFilterFactory"/>
>>            <filter class="solr.CJKWidthFilterFactory"/>
>>            <filter class="solr.CJKBigramFilterFactory"/>
>>        </analyzer>
>>     </fieldType>
>>
>> Japanese
>> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
>>    <analyzer>
>>      <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>>       <filter class="solr.JapaneseBaseFormFilterFactory"/>
>>       <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" />
>>       <filter class="solr.CJKWidthFilterFactory"/>
>>       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" />
>>       <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>    </analyzer>
>> </fieldType>
>>
>> Korean
>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>>       <analyzer type="index">
>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true"  bigrammable="true"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false"  bigrammable="false"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>>       </analyzer>
>>     </fieldType>
>>
>> I am really struck how to implement this. Please help me.
>>
>> Thanks,
>> Poornima
>>
>>
>>
>> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>
>>
>>
>> I don't think Solr ships with Korean Tokenizer, does it?
>>
>> If you are using a 3rd party one, you need to give full class name,
>> not just solr.Korean... And you need the library added in the lib
>> statement in solrconfig.xml (at least in Solr 4).
>>
>> Regards,
>>    Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
>> <po...@rocketmail.com> wrote:
>>> I have defined the fieldtype inside the fields section.  When i checked the error log i found the below error
>>>
>>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>>
>>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>>>
>>>
>>> Do i need to add any libraries for koreanTokenizer?
>>>
>>> Regards,
>>> Poornima
>>>
>>>
>>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>>
>>>
>>>
>>> Double check your xml file that you don't - for example - define your
>>> fieldType outside of fields section. Or maybe you have exception
>>> earlier about some component in the type definition.
>>>
>>> This is not about Korean language, it seems. Something more
>>> fundamentally about XML config.
>>>
>>> Regards,
>>>    Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>>
>>>
>>>
>>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>>> <po...@rocketmail.com> wrote:
>>>> Hi,
>>>>
>>>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>>>> as below in my schema file but the fieldtype is not working.
>>>>
>>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>>>
>>>>       <analyzer type="index">
>>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>>> hasCNoun="true"  bigrammable="true"/>
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_kr.txt"/>
>>>>       </analyzer>
>>>>       <analyzer type="query">
>>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>>> hasCNoun="false"  bigrammable="false"/>
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_kr.txt"/>
>>>>       </analyzer>
>>>>     </fieldType>
>>>>
>>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>>> 'text_kr' specified on field product_name_kr
>>>>
>>>> Regards,
>>>> Poornima
>>>>

Re: Korean Tokenizer in solr

Posted by Poornima Jay <po...@rocketmail.com>.

I have upgrade the solr version to 4.8.1. But after making changes in the schema file i am getting the below error
Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do I need to make any configuration changes to get this working.

Please advice.

Regards,
Poornima


On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
 


I would suggest you read through all 12 (?) articles in this series:
http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
. It will probably lay out most of the issues for you.

And if you are starting, I would really suggest using the latest Solr
(4.9). A lot more people remember what the latest version has then
what was in 3.6. And, as the series above will tell you, some relevant
issues had been fixed in more recent Solr versions.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency



On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one.
> Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately.
>
> I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean.
>
> <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>      <analyzer>
>         <tokenizer class="solr.CJKTokenizerFactory" />
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>         <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
>         <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
>         <filter class="solr.ICUFoldingFilterFactory"/>
>         <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>       </analyzer>
>     </fieldtype>
>
> So i tried to implement individual fieldtype for each language as below
>
> Chinese
>  <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>      <analyzer>
>          <tokenizer class="solr.ICUTokenizerFactory"/>
>            <filter class="solr.ICUFoldingFilterFactory"/>
>            <filter class="solr.CJKWidthFilterFactory"/>
>            <filter class="solr.CJKBigramFilterFactory"/>
>        </analyzer>
>     </fieldType>
>
> Japanese
> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
>    <analyzer>
>      <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>       <filter class="solr.JapaneseBaseFormFilterFactory"/>
>       <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" />
>       <filter class="solr.CJKWidthFilterFactory"/>
>       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" />
>       <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>    </analyzer>
> </fieldType>
>
> Korean
> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>       <analyzer type="index">
>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>         <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true"  bigrammable="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>         <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false"  bigrammable="false"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>       </analyzer>
>     </fieldType>
>
> I am really struck how to implement this. Please help me.
>
> Thanks,
> Poornima
>
>
>
> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> I don't think Solr ships with Korean Tokenizer, does it?
>
> If you are using a 3rd party one, you need to give full class name,
> not just solr.Korean... And you need the library added in the lib
> statement in solrconfig.xml (at least in Solr 4).
>
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> I have defined the fieldtype inside the fields section.  When i checked the error log i found the below error
>>
>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>
>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>>
>>
>> Do i need to add any libraries for koreanTokenizer?
>>
>> Regards,
>> Poornima
>>
>>
>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>
>>
>>
>> Double check your xml file that you don't - for example - define your
>> fieldType outside of fields section. Or maybe you have exception
>> earlier about some component in the type definition.
>>
>> This is not about Korean language, it seems. Something more
>> fundamentally about XML config.
>>
>> Regards,
>>    Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>> <po...@rocketmail.com> wrote:
>>> Hi,
>>>
>>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>>> as below in my schema file but the fieldtype is not working.
>>>
>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>>
>>>       <analyzer type="index">
>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>> hasCNoun="true"  bigrammable="true"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords_kr.txt"/>
>>>       </analyzer>
>>>       <analyzer type="query">
>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>> hasCNoun="false"  bigrammable="false"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords_kr.txt"/>
>>>       </analyzer>
>>>     </fieldType>
>>>
>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>> 'text_kr' specified on field product_name_kr
>>>
>>> Regards,
>>> Poornima
>>>

Re: Korean Tokenizer in solr

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

I would suggest you read through all 12 (?) articles in this series:
http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
. It will probably lay out most of the issues for you.

And if you are starting, I would really suggest using the latest Solr
(4.9). A lot more people remember what the latest version has then
what was in 3.6. And, as the series above will tell you, some relevant
issues had been fixed in more recent Solr versions.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one.
> Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately.
>
> I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean.
>
> <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>      <analyzer>
>         <tokenizer class="solr.CJKTokenizerFactory" />
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>         <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
>         <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
>         <filter class="solr.ICUFoldingFilterFactory"/>
>         <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>       </analyzer>
>     </fieldtype>
>
> So i tried to implement individual fieldtype for each language as below
>
> Chinese
>  <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>      <analyzer>
>          <tokenizer class="solr.ICUTokenizerFactory"/>
>            <filter class="solr.ICUFoldingFilterFactory"/>
>            <filter class="solr.CJKWidthFilterFactory"/>
>            <filter class="solr.CJKBigramFilterFactory"/>
>        </analyzer>
>     </fieldType>
>
> Japanese
> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
>    <analyzer>
>      <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>       <filter class="solr.JapaneseBaseFormFilterFactory"/>
>       <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" />
>       <filter class="solr.CJKWidthFilterFactory"/>
>       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" />
>       <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>    </analyzer>
> </fieldType>
>
> Korean
> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>       <analyzer type="index">
>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>         <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true"  bigrammable="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>         <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false"  bigrammable="false"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>       </analyzer>
>     </fieldType>
>
> I am really struck how to implement this. Please help me.
>
> Thanks,
> Poornima
>
>
>
> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> I don't think Solr ships with Korean Tokenizer, does it?
>
> If you are using a 3rd party one, you need to give full class name,
> not just solr.Korean... And you need the library added in the lib
> statement in solrconfig.xml (at least in Solr 4).
>
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> I have defined the fieldtype inside the fields section.  When i checked the error log i found the below error
>>
>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>
>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>>
>>
>> Do i need to add any libraries for koreanTokenizer?
>>
>> Regards,
>> Poornima
>>
>>
>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>>
>>
>>
>> Double check your xml file that you don't - for example - define your
>> fieldType outside of fields section. Or maybe you have exception
>> earlier about some component in the type definition.
>>
>> This is not about Korean language, it seems. Something more
>> fundamentally about XML config.
>>
>> Regards,
>>    Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>> <po...@rocketmail.com> wrote:
>>> Hi,
>>>
>>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>>> as below in my schema file but the fieldtype is not working.
>>>
>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>>
>>>       <analyzer type="index">
>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>> hasCNoun="true"  bigrammable="true"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords_kr.txt"/>
>>>       </analyzer>
>>>       <analyzer type="query">
>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>> hasCNoun="false"  bigrammable="false"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords_kr.txt"/>
>>>       </analyzer>
>>>     </fieldType>
>>>
>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>> 'text_kr' specified on field product_name_kr
>>>
>>> Regards,
>>> Poornima
>>>

Re: Korean Tokenizer in solr

Posted by Poornima Jay <po...@rocketmail.com>.

Till now I was thinking solr will support KoreanTokenizer. I haven't used any other 3rd party one. 
Actually the issue i am facing is I need to integrate English, Chinese, Japanese and Korean language search in a single site. Based on the user's selected language to search the fields will be queried appropriately. 

I tried using cjk for all the 3 languages like below but only few search terms work for Chinese and Japanese. nothing works for Korean.

<fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
     <analyzer>        
        <tokenizer class="solr.CJKTokenizerFactory" />
        <filter class="solr.CJKWidthFilterFactory"/>
        <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
        <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
        <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
      </analyzer>
    </fieldtype>

So i tried to implement individual fieldtype for each language as below

Chinese
 <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
     <analyzer>
         <tokenizer class="solr.ICUTokenizerFactory"/>
           <filter class="solr.ICUFoldingFilterFactory"/>
           <filter class="solr.CJKWidthFilterFactory"/>
           <filter class="solr.CJKBigramFilterFactory"/>
       </analyzer>
    </fieldType>

Japanese
<fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
   <analyzer>
     <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
      <filter class="solr.JapaneseBaseFormFilterFactory"/>
      <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" />
      <filter class="solr.CJKWidthFilterFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" />
      <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
      <filter class="solr.LowerCaseFilterFactory"/>
   </analyzer>
</fieldType>

Korean
<fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
      <analyzer type="index">
        <tokenizer class="solr.KoreanTokenizerFactory"/>
        <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true"  bigrammable="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KoreanTokenizerFactory"/>
        <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false"  bigrammable="false"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
      </analyzer>      
    </fieldType>

I am really struck how to implement this. Please help me.

Thanks,
Poornima



On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
 


I don't think Solr ships with Korean Tokenizer, does it?

If you are using a 3rd party one, you need to give full class name,
not just solr.Korean... And you need the library added in the lib
statement in solrconfig.xml (at least in Solr 4).

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency



On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> I have defined the fieldtype inside the fields section.  When i checked the error log i found the below error
>
> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>
> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>
>
> Do i need to add any libraries for koreanTokenizer?
>
> Regards,
> Poornima
>
>
> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> Double check your xml file that you don't - for example - define your
> fieldType outside of fields section. Or maybe you have exception
> earlier about some component in the type definition.
>
> This is not about Korean language, it seems. Something more
> fundamentally about XML config.
>
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> Hi,
>>
>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>> as below in my schema file but the fieldtype is not working.
>>
>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>
>>       <analyzer type="index">
>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>> hasCNoun="true"  bigrammable="true"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_kr.txt"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>> hasCNoun="false"  bigrammable="false"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_kr.txt"/>
>>       </analyzer>
>>     </fieldType>
>>
>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>> 'text_kr' specified on field product_name_kr
>>
>> Regards,
>> Poornima
>>

Re: Korean Tokenizer in solr

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

I don't think Solr ships with Korean Tokenizer, does it?

If you are using a 3rd party one, you need to give full class name,
not just solr.Korean... And you need the library added in the lib
statement in solrconfig.xml (at least in Solr 4).

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> I have defined the fieldtype inside the fields section.  When i checked the error log i found the below error
>
> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>
> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list
>
>
> Do i need to add any libraries for koreanTokenizer?
>
> Regards,
> Poornima
>
>
> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
>
>
> Double check your xml file that you don't - for example - define your
> fieldType outside of fields section. Or maybe you have exception
> earlier about some component in the type definition.
>
> This is not about Korean language, it seems. Something more
> fundamentally about XML config.
>
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
> <po...@rocketmail.com> wrote:
>> Hi,
>>
>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>> as below in my schema file but the fieldtype is not working.
>>
>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>
>>       <analyzer type="index">
>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>> hasCNoun="true"  bigrammable="true"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_kr.txt"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>> hasCNoun="false"  bigrammable="false"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_kr.txt"/>
>>       </analyzer>
>>     </fieldType>
>>
>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>> 'text_kr' specified on field product_name_kr
>>
>> Regards,
>> Poornima
>>

Re: Korean Tokenizer in solr

Posted by Poornima Jay <po...@rocketmail.com>.

I have defined the fieldtype inside the fields section.  When i checked the error log i found the below error

Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory

SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer & filter list

Do i need to add any libraries for koreanTokenizer?

Regards,
Poornima

On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:

Double check your xml file that you don't - for example - define your
fieldType outside of fields section. Or maybe you have exception
earlier about some component in the type definition.

This is not about Korean language, it seems. Something more
fundamentally about XML config.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency

On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> Hi,
>
> Anyone tried to implement korean language in solr 3.6.1. I define the field
> as below in my schema file but the fieldtype is not working.
>
> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>
>       <analyzer type="index">
>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>         <filter class="solr.KoreanFilterFactory" hasOrigin="true"
> hasCNoun="true"  bigrammable="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_kr.txt"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>         <filter class="solr.KoreanFilterFactory" hasOrigin="false"
> hasCNoun="false"  bigrammable="false"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_kr.txt"/>
>       </analyzer>
>     </fieldType>
>
> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
> 'text_kr' specified on field product_name_kr
>
> Regards,
> Poornima
>

Re: Korean Tokenizer in solr

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Double check your xml file that you don't - for example - define your
fieldType outside of fields section. Or maybe you have exception
earlier about some component in the type definition.

This is not about Korean language, it seems. Something more
fundamentally about XML config.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
<po...@rocketmail.com> wrote:
> Hi,
>
> Anyone tried to implement korean language in solr 3.6.1. I define the field
> as below in my schema file but the fieldtype is not working.
>
> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>
>       <analyzer type="index">
>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>         <filter class="solr.KoreanFilterFactory" hasOrigin="true"
> hasCNoun="true"  bigrammable="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_kr.txt"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>         <filter class="solr.KoreanFilterFactory" hasOrigin="false"
> hasCNoun="false"  bigrammable="false"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_kr.txt"/>
>       </analyzer>
>     </fieldType>
>
> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
> 'text_kr' specified on field product_name_kr
>
> Regards,
> Poornima
>