You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/08/05 12:37:14 UTC

[jira] Created: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Add support for lucene's SmartChineseAnalyzer
---------------------------------------------

                 Key: SOLR-1336
                 URL: https://issues.apache.org/jira/browse/SOLR-1336
             Project: Solr
          Issue Type: New Feature
          Components: Analysis
            Reporter: Robert Muir


SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.

if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 

note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750439#action_12750439 ] 

Robert Muir commented on SOLR-1336:
-----------------------------------

bq. Can this be customized to accomodate those languages?
Maybe, but we have to do work first. the dictionary is limited to GB2312 encoding, so we can't add support for new languages until this is fixed.

bq. Is there any wiki link or document to help us understand how this tool works? Sort of behind the scenes....
There are some sparse javadocs or code comments. also see the original jira ticket: LUCENE-1629

bq. What exactly does the dictionary contain? Is it any ordinary chinese dictionary or some sort of a customized/lemmatized dictionary? 
There are two dictionaries: word dictionary, and bigram dictionary. 
These dictionaries contain words and bigrams respectively, along with frequency, in a "trie"-like structure organized by chinese character.

bq. Also, how can one add new words to the dictionary?
This is currently really difficult. please see LUCENE-1817 for some background information. 
For the moment you will have to recompile your own custom jar file, and be familiar with the file formats the analyzer uses.
Note, we put strong warnings as we would like to change the file formats in an upcoming release, to something based on Unicode.
This way, we can support more languages, and perhaps also make it easier to customize the dictionary data


> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754539#action_12754539 ] 

Robert Muir commented on SOLR-1336:
-----------------------------------

bq. contrib?

sounds reasonable to me. in a few days i can upload a new patch.

> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756592#action_12756592 ] 

Robert Muir commented on SOLR-1336:
-----------------------------------

Thanks, so do we want a contrib (which would mostly just be the jar file + the 2 factories) or should it go in example/solr/lib?

If we do the latter, where should i put factories? These could be useful if someone wants the chinese analysis to work a little different, 
for example SmartChineseAnalyzer does porter stemming on english but someone might not want that.

> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749877#action_12749877 ] 

Robert Muir commented on SOLR-1336:
-----------------------------------

Hi, thanks for testing!

first, I am having trouble trying to figure out what is going on here, since it looks like the stack trace is unrelated to smart chinese analyzer.
its a little bit more difficult since i am looking at the latest solr code and my tokenizerchain:64 is not tokenStream() !

Due to the exception you are getting, I suspect something is out of date... maybe its as simple as 'ant clean'  and recompile?

> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753247#action_12753247 ] 

Yonik Seeley commented on SOLR-1336:
------------------------------------

I was going to check this out, but Lucene 2.9_RC3 doesn't work with Solr - need to wait for RC4.

Any objections to committing this for 1.4 and adding it to the example server, provided we can verify that there isn't a memory cost if it's not used?  The downside is a 3MB jar in solr/lib and in the solr.war

> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759616#action_12759616 ] 

Robert Muir commented on SOLR-1336:
-----------------------------------

{quote}
Perhaps we could make them lazy load? token streams are reused now, so a small reflection overhead is no longer an issue.
{quote}

If we do this, then we could avoid a contrib that is really just a jar file? and instead could the jar file just go in the example/solr/lib?


> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754541#action_12754541 ] 

Yonik Seeley commented on SOLR-1336:
------------------------------------

I agree it would be an awkward thing to have inside solr.war
Should we copy to example/solr/lib like the Tika libs are (we already have 32MB of jars there)?


> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754481#action_12754481 ] 

Hoss Man commented on SOLR-1336:
--------------------------------

bq. The downside is a 3MB jar in solr/lib and in the solr.war

contrib?

Chinese isn't something everybody needs, and 3MB would almost double the size of the solr.war. 

> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740988#action_12740988 ] 

Robert Muir commented on SOLR-1336:
-----------------------------------

{quote}
Are the stopwords (words="org/apache/lucene/analysis/cn/stopwords.txt") being loaded directly from the jar? If so, a comment to that effect might prevent some confusion. 
{quote}

Yes, good idea.

{quote}
Do you happen to know what the memory footprint of this analyzer is if it's used? I assume the dictionaries will get loaded on the first use.
{quote}

No, I am not sure of the footprint, but it is probably quite large (a few MB). They will be loaded on first use, correct. Also, the smartcn jar file itself is large due to the dictionaries in question. So, you may have noticed solr.war is much smaller after the last lucene update, since it was factored out of analyzers.jar. 

{quote}
Might be cool to add a chinese field to example/exampledocs/solr.xml... or maybe there should be an international.xml doc where we could add a few different languages?
{quote}

I figured this wasn't the best place to have an example... i like the idea of international.xml, with some examples for other languages too.

If there is some concern about the size of this (monster) analyzer, one option is to put these factories/examples elsewhere, to keep the size of solr smaller. 


> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Kumar Raja (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750342#action_12750342 ] 

Kumar Raja commented on SOLR-1336:
----------------------------------

Since this feature works so well, i think it can easily shipped along with Solr 1.4. 
When is this going to be committed to the Solr build?

> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Kumar Raja (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750340#action_12750340 ] 

Kumar Raja commented on SOLR-1336:
----------------------------------

Hi Robert,
Sorry...my bad. There was a mix up of the Solr versions on my machine which caused this error.

This tool is great. It works wonderful and there is a test case pass rate is amazing!!!! Is there a similar tool for other asian languages, say Japanese and Korean? Can this be customized to accomodate those languages?

Is there any wiki link or document to help us understand how this tool works? Sort of behind the scenes.... What exactly does the dictionary contain? Is it any ordinary chinese dictionary or some sort of a customized/lemmatized dictionary? Also, how can one add new words to the dictionary?

Thanks,
Kumar

> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759611#action_12759611 ] 

Yonik Seeley commented on SOLR-1336:
------------------------------------

I guess it should go into contrib for now...
bq. where should i put factories?

It would be nice if we could avoid another jar, just for 2 small classes.
Perhaps we could make them lazy load?  token streams are reused now, so a small reflection overhead is no longer an issue.


> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750446#action_12750446 ] 

Robert Muir commented on SOLR-1336:
-----------------------------------

Kumar, by the way, I wanted to mention if by any chance you feel inclined to help us improve this analyzer, please don't hesitate!

There is so much work to do: dictionary format, code refactoring, better unicode support, among other things.
Even if you don't want to write any code but have good Chinese & English skills, there are still some javadocs in Chinese that haven't been translated.


> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740981#action_12740981 ] 

Yonik Seeley commented on SOLR-1336:
------------------------------------

Thanks Robert!
Are the stopwords (words="org/apache/lucene/analysis/cn/stopwords.txt") being loaded directly from the jar?  If so, a comment to that effect might prevent some confusion.

Do you happen to know what the memory footprint of this analyzer is if it's used?  I assume the dictionaries will get loaded on the first use.

Might be cool to add a chinese field to example/exampledocs/solr.xml... or maybe there should be an international.xml doc where we could add a few different languages?

> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-1336:
------------------------------

    Attachment: SOLR-1336.patch

patch, needs lucene-smartcn-2.9-dev.jar added to lib to work (this analyzer is not in the -analyzers.jar anymore)


> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756177#action_12756177 ] 

Stanislaw Osinski commented on SOLR-1336:
-----------------------------------------

Keeping the Chinese analyzer JAR optional sounds good. As Carrot2 also uses it, I'd need to make sure the clustering contrib doesn't fail when the JAR is not there and clustering in Chinese is requested (I think I'd simply log a WARN saying that the Chinese analyzer JAR is required for best clustering results).

> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Kumar Raja (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749859#action_12749859 ] 

Kumar Raja commented on SOLR-1336:
----------------------------------

I applied the patch with the latest Solr code and lucene-rc2 jars and tried indexing the some chinese text. However, i got a AbstractMethodError during tokenization. 
What am i doing wrong here?


h4.{{*THE STACK TRACE*}}
{noformat} 
SEVERE: java.lang.AbstractMethodError
        at org.apache.solr.analysis.TokenizerChain.tokenStream(TokenizerChain.java:64)
        at org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.tokenStream(IndexSchema.java:360)
        at org.apache.lucene.analysis.Analyzer.reusableTokenStream(Analyzer.java:44)
        at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:123)
        at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
        at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:762)
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:745)
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2199)
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2171)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:218)
        at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
        at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)
        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
{noformat} 


> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759630#action_12759630 ] 

Robert Muir commented on SOLR-1336:
-----------------------------------

Yonik maybe it would be better to wait until these things settle out first? (I glanced at issues and saw -1, +1, and such)

I guess there is always the option for release 1.4, do nothing, and instruct users that want to use this analyzer to put lucene-smartcn-2.9.jar in their lib and use analyzer= (they will be stuck with porter stemming and such for now though)


> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-1336:
------------------------------

    Attachment: SOLR-1336.patch

add warning about large dictionaries, note that stopwords are being loaded from jar file, and add an international.xml with examples for several languages.

> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-1336:
------------------------------

    Attachment: SOLR-1336.patch

we moved some parts of this analyzer around in LUCENE-1882

this syncs the patch up with lucene trunk (not rc2 as it does not reflect LUCENE-1882).


> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759625#action_12759625 ] 

Yonik Seeley commented on SOLR-1336:
------------------------------------

In theory perhaps, but one problem is that example/solr/lib isn't even in svn... nothing lives there, but is copied there (currently).
There's been a lot of discussions on solr-dev lately about where the tika libs should live, etc... 
http://search.lucidimagination.com/search/document/a9520632864db021/distinct_example_for_solr_cell
And SOLR-1449 is also in the mix as a way to reference jars outside of the example lib.

> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.
> if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.