You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Gao Pinker <xi...@gmail.com> on 2009/04/16 15:58:51 UTC

I wanna contribute a Chinese analyzer to lucene

Hi All!

I wrote a Analyzer for apache lucene for analyzing sentences in
*Chinese*language, it's called
*imdict-chinese-analyzer* as it is a subproject of
*imdict*<http://www.imdict.net/>,
which is an intelligent online dictionary.

The project on google code is here:
http://code.google.com/p/imdict-chinese-analyzer/

In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)
"中国人"(Chinese), *not* "我" "是中" "国人". So the analyzer must handle each
sentence properly, or there will be mis-understandings everywhere in the
index constructed by Lucene, and the accuracy of the search engine will be
affected seriously!

Although there are two analyzer packages in apache repository which can
handle Chinese:
ChineseAnalyzer<http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/>and
CJKAnalyzer<http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/>,
they take each character or every two adjoining characters as a single word,
this is obviously not true in reality, also this strategy will increase the
index size and hit the performance baddly.

The algorithm of* imdict-chinese-analyzer* is based on Hidden Markov Model
(HMM), so it can tokenize chinese sentence in a really intelligent way.
Tokenizaion accuracy of this model is above 90% according to the paper
"HHMM-based
Chinese Lexical analyzer
ICTCLAL<http://www.nlp.org.cn/project/project.php?proj_id=6>
".

As *imdict-chinese-analyzer* is a really fast intelligent Chinese Analyzer
for lucene written in Java. I want to share this project with every one
using Lucene.

This Analyzer contains two packages, *the source code* and the *lexical
dictionary*. I want to publish the source code using Apache license, but the
dictionary which is under an ambigus license was not create by me.
So, can I only submit the source code to lucene contribution repository, and
let the users download the dictionary from the google code site?

please help me about this contribution.

Re: I wanna contribute a Chinese analyzer to lucene

Posted by Otis Gospodnetic <ot...@yahoo.com>.

This would be a great contribution.
I took a quick look at the ZIP file and noticed it depends on, say, net.imdict.wordsegment.WordSegmenter, but I didn't see that class anywhere.  I assume you will patch and polish things, but I thought I'd point this out.


Thanks!
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




________________________________
From: Gao Pinker <xi...@gmail.com>
To: java-dev@lucene.apache.org
Sent: Thursday, April 16, 2009 9:58:51 AM
Subject: I wanna contribute a Chinese analyzer to lucene

Hi All!

I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language, it's called imdict-chinese-analyzer as it is a subproject of imdict, which is an intelligent online dictionary.

The project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/

In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously!

Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hit the performance baddly.

The algorithm ofimdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can  tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL".

As imdict-chinese-analyzer is a really fast intelligent Chinese Analyzer for lucene written in Java. I want to share this project with every one using Lucene.

This Analyzer contains two packages, the source code and the lexical dictionary. I want to publish the source code using Apache license, but the dictionary which is under an ambigus license was not create by me.
So, can I only submit the source code to lucene contribution repository, and let the users download the dictionary from the google code site?

please help me about this contribution.

RE: I wanna contribute a Chinese analyzer to lucene

Posted by Steven A Rowe <sa...@syr.edu>.

In addition to Ken's suggestions, check out http://wiki.apache.org/lucene-java/HowToContribute for some help on getting set up. - Steve

From: Ken Krugler [mailto:kkrugler_lists@transpac.com]
Sent: Thursday, April 16, 2009 10:16 AM
To: java-dev@lucene.apache.org
Subject: Re: I wanna contribute a Chinese analyzer to lucene

I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language, it's called imdict-chinese-analyzer as it is a subproject of imdict<http://www.imdict.net/>, which is an intelligent online dictionary.

The project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/

I took a quick look, but didn't see any code posted there yet.

[snip]

This Analyzer contains two packages, the source code and the lexical dictionary. I want to publish the source code using Apache license, but the dictionary which is under an ambigus license was not create by me.
So, can I only submit the source code to lucene contribution repository, and let the users download the dictionary from the google code site?

I believe your code can be a contrib, with a reference to the dictionary. So a first step would be to open an issue in Lucene's Jira (http://issues.apache.org/jira/browse/LUCENE), and post your source as a patch.

The best way to get the right answer to the legal issue is to post it to the legal-discuss@apache.org list (join it first), as Apache's lawyers can then respond to your specific question.

-- Ken

--
Ken Krugler
+1 530-210-6378

Re: I wanna contribute a Chinese analyzer to lucene

Posted by Earwin Burrfoot <ea...@gmail.com>.

On Thu, Apr 16, 2009 at 18:16, Ken Krugler <kk...@transpac.com> wrote:
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese
> language, it's called imdict-chinese-analyzer as it is a subproject of
> imdict, which is an intelligent online dictionary.
>
> The project on google code is here:
> http://code.google.com/p/imdict-chinese-analyzer/
>
> I took a quick look, but didn't see any code posted there yet.
http://code.google.com/p/imdict-chinese-analyzer/downloads/list  ?

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: I wanna contribute a Chinese analyzer to lucene

Posted by John Wang <jo...@gmail.com>.

I understand. It would be good to check into the svn google code provides.
-John

On Tue, May 5, 2009 at 7:55 AM, Gao Pinker <xi...@gmail.com> wrote:

> Hi, I have already put the source code in the zip file
>
> http://imdict-chinese-analyzer.googlecode.com/files/imdict-chinese-analyzer.zip
>
>
> On Tue, May 5, 2009 at 10:47 PM, John Wang <jo...@gmail.com> wrote:
>
>> Hi Gao:
>>     On the google code page, can you check in the source?
>>
>> Thanks
>>
>> -John
>>
>>
>> On Tue, May 5, 2009 at 2:30 AM, Gao Pinker <xi...@gmail.com> wrote:
>>
>>> I have opened a new issue(
>>> http://issues.apache.org/jira/browse/LUCENE-1629) and now creating the
>>> patch,
>>> There are 2500 lines of code to be added cause this Chinese analyzer is
>>> really complex.
>>>
>>> Now I'm having 2 problems:
>>>
>>> 1. My code depends on log4j, but I found no other analyzers depend on it,
>>> so should I keep the dependence or change my code to remove it?
>>>
>>> 2. This analyzer needs some lexical dictionary(under apache license v2,
>>> about 7MB), should it be put into lucene svn repository or just place a
>>> reference to let the user download from it's official site<http://code.google.com/p/imdict-chinese-analyzer/>
>>> ?
>>>
>>>
>>> On Mon, May 4, 2009 at 5:44 PM, Michael McCandless <
>>> lucene@mikemccandless.com> wrote:
>>>
>>>> On Mon, May 4, 2009 at 12:21 AM, Gao Pinker <xi...@gmail.com>
>>>> wrote:
>>>>
>>>> > I've got a lexical dictionary from the author under apache license v2,
>>>> and
>>>> > the code is all written by myself,
>>>> > so, the legal problems are solved entirely.
>>>>
>>>> Excellent!
>>>>
>>>> > Now could you please tell me how to open an issue in Lucene's Jira
>>>> > (http://issues.apache.org/jira/browse/LUCENE),
>>>> > so I can post my source as a patch?
>>>>
>>>> Just create yourself an account in Jira
>>>> (https://issues.apache.org/jira/secure/Signup!default.jspa<https://issues.apache.org/jira/secure/Signup%21default.jspa>)
>>>> and then
>>>> you can open a new issue & attach the patch.
>>>>
>>>> Thanks!
>>>>
>>>> Mike
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>>
>>>
>>
>

Re: I wanna contribute a Chinese analyzer to lucene

Posted by Gao Pinker <xi...@gmail.com>.

Hi, I have already put the source code in the zip file
http://imdict-chinese-analyzer.googlecode.com/files/imdict-chinese-analyzer.zip

On Tue, May 5, 2009 at 10:47 PM, John Wang <jo...@gmail.com> wrote:

> Hi Gao:
>     On the google code page, can you check in the source?
>
> Thanks
>
> -John
>
>
> On Tue, May 5, 2009 at 2:30 AM, Gao Pinker <xi...@gmail.com> wrote:
>
>> I have opened a new issue(
>> http://issues.apache.org/jira/browse/LUCENE-1629) and now creating the
>> patch,
>> There are 2500 lines of code to be added cause this Chinese analyzer is
>> really complex.
>>
>> Now I'm having 2 problems:
>>
>> 1. My code depends on log4j, but I found no other analyzers depend on it,
>> so should I keep the dependence or change my code to remove it?
>>
>> 2. This analyzer needs some lexical dictionary(under apache license v2,
>> about 7MB), should it be put into lucene svn repository or just place a
>> reference to let the user download from it's official site<http://code.google.com/p/imdict-chinese-analyzer/>
>> ?
>>
>>
>> On Mon, May 4, 2009 at 5:44 PM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>> On Mon, May 4, 2009 at 12:21 AM, Gao Pinker <xi...@gmail.com>
>>> wrote:
>>>
>>> > I've got a lexical dictionary from the author under apache license v2,
>>> and
>>> > the code is all written by myself,
>>> > so, the legal problems are solved entirely.
>>>
>>> Excellent!
>>>
>>> > Now could you please tell me how to open an issue in Lucene's Jira
>>> > (http://issues.apache.org/jira/browse/LUCENE),
>>> > so I can post my source as a patch?
>>>
>>> Just create yourself an account in Jira
>>> (https://issues.apache.org/jira/secure/Signup!default.jspa<https://issues.apache.org/jira/secure/Signup%21default.jspa>)
>>> and then
>>> you can open a new issue & attach the patch.
>>>
>>> Thanks!
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>

Re: I wanna contribute a Chinese analyzer to lucene

Posted by John Wang <jo...@gmail.com>.

Hi Gao:
    On the google code page, can you check in the source?

Thanks

-John

On Tue, May 5, 2009 at 2:30 AM, Gao Pinker <xi...@gmail.com> wrote:

> I have opened a new issue(http://issues.apache.org/jira/browse/LUCENE-1629)
> and now creating the patch,
> There are 2500 lines of code to be added cause this Chinese analyzer is
> really complex.
>
> Now I'm having 2 problems:
>
> 1. My code depends on log4j, but I found no other analyzers depend on it,
> so should I keep the dependence or change my code to remove it?
>
> 2. This analyzer needs some lexical dictionary(under apache license v2,
> about 7MB), should it be put into lucene svn repository or just place a
> reference to let the user download from it's official site<http://code.google.com/p/imdict-chinese-analyzer/>
> ?
>
>
> On Mon, May 4, 2009 at 5:44 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> On Mon, May 4, 2009 at 12:21 AM, Gao Pinker <xi...@gmail.com>
>> wrote:
>>
>> > I've got a lexical dictionary from the author under apache license v2,
>> and
>> > the code is all written by myself,
>> > so, the legal problems are solved entirely.
>>
>> Excellent!
>>
>> > Now could you please tell me how to open an issue in Lucene's Jira
>> > (http://issues.apache.org/jira/browse/LUCENE),
>> > so I can post my source as a patch?
>>
>> Just create yourself an account in Jira
>> (https://issues.apache.org/jira/secure/Signup!default.jspa<https://issues.apache.org/jira/secure/Signup%21default.jspa>)
>> and then
>> you can open a new issue & attach the patch.
>>
>> Thanks!
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

Re: I wanna contribute a Chinese analyzer to lucene

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, May 5, 2009 at 5:30 AM, Gao Pinker <xi...@gmail.com> wrote:

> 1. My code depends on log4j, but I found no other analyzers depend on it, so
> should I keep the dependence or change my code to remove it?

If removing the dependency is not difficult, I would remove it?  It
might inhibit others who otherwise would want to use the analyzer?

> 2. This analyzer needs some lexical dictionary(under apache license v2,
> about 7MB), should it be put into lucene svn repository or just place a
> reference to let the user download from it's official site?

Either way is OK, though I would say include it in Lucene since the
license allows it?  (We've had issues with downloads sites sometimes
being down...).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: I wanna contribute a Chinese analyzer to lucene

Posted by Gao Pinker <xi...@gmail.com>.

I have opened a new issue(http://issues.apache.org/jira/browse/LUCENE-1629)
and now creating the patch,
There are 2500 lines of code to be added cause this Chinese analyzer is
really complex.

Now I'm having 2 problems:

1. My code depends on log4j, but I found no other analyzers depend on it, so
should I keep the dependence or change my code to remove it?

2. This analyzer needs some lexical dictionary(under apache license v2,
about 7MB), should it be put into lucene svn repository or just place a
reference to let the user download from it's official
site<http://code.google.com/p/imdict-chinese-analyzer/>
?

On Mon, May 4, 2009 at 5:44 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Mon, May 4, 2009 at 12:21 AM, Gao Pinker <xi...@gmail.com> wrote:
>
> > I've got a lexical dictionary from the author under apache license v2,
> and
> > the code is all written by myself,
> > so, the legal problems are solved entirely.
>
> Excellent!
>
> > Now could you please tell me how to open an issue in Lucene's Jira
> > (http://issues.apache.org/jira/browse/LUCENE),
> > so I can post my source as a patch?
>
> Just create yourself an account in Jira
> (https://issues.apache.org/jira/secure/Signup!default.jspa<https://issues.apache.org/jira/secure/Signup%21default.jspa>)
> and then
> you can open a new issue & attach the patch.
>
> Thanks!
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: I wanna contribute a Chinese analyzer to lucene

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, May 4, 2009 at 12:21 AM, Gao Pinker <xi...@gmail.com> wrote:

> I've got a lexical dictionary from the author under apache license v2, and
> the code is all written by myself,
> so, the legal problems are solved entirely.

Excellent!

> Now could you please tell me how to open an issue in Lucene's Jira
> (http://issues.apache.org/jira/browse/LUCENE),
> so I can post my source as a patch?

Just create yourself an account in Jira
(https://issues.apache.org/jira/secure/Signup!default.jspa) and then
you can open a new issue & attach the patch.

Thanks!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: I wanna contribute a Chinese analyzer to lucene

Posted by Gao Pinker <xi...@gmail.com>.

Hi all !

I've got a lexical dictionary from the author under apache license v2, and
the code is all written by myself,
so, the legal problems are solved entirely.

Now could you please tell me how to open an issue in Lucene's Jira (
http://issues.apache.org/jira/browse/LUCENE),
so I can post my source as a patch?
Thank you!

On Thu, Apr 16, 2009 at 10:16 PM, Ken Krugler
<kk...@transpac.com>wrote:

>  I wrote a Analyzer for apache lucene for analyzing sentences in* Chinese*language, it's called
> * imdict-chinese-analyzer* as it is a subproject of *imdict*<http://www.imdict.net/>,
> which is an intelligent online dictionary.
>
> The project on google code is here:
> http://code.google.com/p/imdict-chinese-analyzer/
>
>
> I took a quick look, but didn't see any code posted there yet.
>
> [snip]
>
> This Analyzer contains two packages,* the source code* and the* lexical
> dictionary*. I want to publish the source code using Apache license, but
> the dictionary which is under an ambigus license was not create by me.
>
> So, can I only submit the source code to lucene contribution repository,
> and let the users download the dictionary from the google code site?
>
>
> I believe your code can be a contrib, with a reference to the dictionary.
> So a first step would be to open an issue in Lucene's Jira (
> http://issues.apache.org/jira/browse/LUCENE), and post your source as a
> patch.
>
> The best way to get the right answer to the legal issue is to post it to
> the legal-discuss@apache.org list (join it first), as Apache's lawyers can
> then respond to your specific question.
>
> -- Ken
>
> --
>
> Ken Krugler
> +1 530-210-6378
>

Re: I wanna contribute a Chinese analyzer to lucene

Posted by Ken Krugler <kk...@transpac.com>.

>I wrote a Analyzer for apache lucene for analyzing sentences in 
>Chinese language, it's called imdict-chinese-analyzer as it is a 
>subproject of <http://www.imdict.net/>imdict, which is an 
>intelligent online dictionary.
>
>The project on google code is here: 
><http://code.google.com/p/imdict-chinese-analyzer/>http://code.google.com/p/imdict-chinese-analyzer/

I took a quick look, but didn't see any code posted there yet.

[snip]

>This Analyzer contains two packages, the source code and the lexical 
>dictionary. I want to publish the source code using Apache license, 
>but the dictionary which is under an ambigus license was not create 
>by me.
>So, can I only submit the source code to lucene contribution 
>repository, and let the users download the dictionary from the 
>google code site?

I believe your code can be a contrib, with a reference to the 
dictionary. So a first step would be to open an issue in Lucene's 
Jira (http://issues.apache.org/jira/browse/LUCENE), and post your 
source as a patch.

The best way to get the right answer to the legal issue is to post it 
to the legal-discuss@apache.org list (join it first), as Apache's 
lawyers can then respond to your specific question.

-- Ken
-- 
Ken Krugler
+1 530-210-6378

Re: I wanna contribute a Chinese analyzer to lucene

Posted by Otis Gospodnetic <ot...@yahoo.com>.

 --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

________________________________
From: Gao Pinker <xi...@gmail.com>
To: java-dev@lucene.apache.org
Sent: Thursday, April 16, 2009 9:58:51 AM
Subject: I wanna contribute a Chinese analyzer to lucene

Hi All!

I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language, it's called imdict-chinese-analyzer as it is a subproject of imdict, which is an intelligent online dictionary.

The project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/

In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously!

Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hit the performance baddly.

The algorithm ofimdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can  tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL".

As imdict-chinese-analyzer is a really fast intelligent Chinese Analyzer for lucene written in Java. I want to share this project with every one using Lucene.

This Analyzer contains two packages, the source code and the lexical dictionary. I want to publish the source code using Apache license, but the dictionary which is under an ambigus license was not create by me.
So, can I only submit the source code to lucene contribution repository, and let the users download the dictionary from the google code site?

please help me about this contribution.