You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Mark Bennett <mb...@ideaeng.com> on 2009/10/12 19:46:21 UTC

Fix for Japanese SEN morphological analyzer, and moving into Contrib

Hi folks,

I've been working to fix the Japanese SEN morphological analyzer, which is
currently hosted at:
https://sen.dev.java.net

To review, Japanese doesn't use whitespace for word breaks.  The traditional
approach to CJK (Chinese, Japanese, Korean) is to use bigram character pairs
in the index.  While this works to a point, some believe that using proper
word breaks provides better results.

The "lucene-ja" glue layer between Lucene and the core SEN library broke in
May of '09 when a fix was made in Lucene:
http://issues.apache.org/jira/browse/LUCENE-1636

Uwe S. had a very good insight for a quick fix, and I have been cleaning up
some other issues with the code.  I have also spoken the author Takashi
Okamoto and he is fine to have this moved from java.net to ASF; I think it
will be easier for folks to find and use it if it's in ASF.

I'm not quite ready to submit a patch, but the Wiki suggests emailing the
list with the idea in advance.  There are some packaging questions I'll
have, there's actually quite a few parts.  Also, the wiki didn't quite spell
out the process to get things into contrib, beyond emailing and submitting a
patch.  I also plan to eventually submit a Solr-specific wrapper to the solr
dev list, to allow for dynamic config changes to be made from Solr's
schema.  But since the original code was Lucene based, and it provides the
broadest reach, I think having it in core Lucene would be a good start.

Any comments, suggestions, or mentor volunteers?  :-)

Mark

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Re: Fix for Japanese SEN morphological analyzer, and moving into Contrib

Posted by Robert Muir <rc...@gmail.com>.

Mark, I agree with what you said, it would be great if there was a way to
easily enable this japanese support.

I will let someone else comment on the licensing, but I since you mentioned
source dictionaries, thought Sen only used IPA dic for its data? I could be
wrong on this.

I think its a BSD-like license, here you can read the license as google
chrome prints it... (in a separate really interesting dictionary for CJ
segmentation)

http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/icu38/source/data/brkitr/cjdict.txt

On Mon, Oct 12, 2009 at 2:58 PM, Mark Bennett <mb...@ideaeng.com> wrote:

> Hello Robert,
>
> That's a good question.  The core SEN is under LGPL, yes.  However, I
> didn't need to make changes to that, though given that there are 2 versions
> floating around, I think it needs a good home.
>
> But the glue-layer is under "Apache 2.0" license, and that's the part that
> needed fixing.  I think that means it's ASF / contrib compatible?
>
> There are also 2 other ancillary libraries and some source dictionaries
> which I need to research.
>
> Working from the other direction, which might give you some ideas:
> The goal is to get this more accessible.  It'd be really nice if, in a
> Lucene distribution, the SEN library could be switched on as easily as CJK.
> Or at the most you'd run an ant script to fetch all the parts and assemble
> it.  As it stands now I think it's not used much because it's a bit complex
> to setup, even prior to May '09's change, and most of the users of it
> discuss it in Japanese.  So that's the goal, I'm very open to ideas on the
> "how".
>
> Mark
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
>
> On Mon, Oct 12, 2009 at 11:10 AM, Robert Muir <rc...@gmail.com> wrote:
>
>> Mark, does this mean Sen will be under the Apache license? (it is
>> currently LGPL)
>>
>>
>> On Mon, Oct 12, 2009 at 1:46 PM, Mark Bennett <mb...@ideaeng.com>wrote:
>>
>>> Hi folks,
>>>
>>> I've been working to fix the Japanese SEN morphological analyzer, which
>>> is currently hosted at:
>>> https://sen.dev.java.net
>>>
>>> To review, Japanese doesn't use whitespace for word breaks.  The
>>> traditional approach to CJK (Chinese, Japanese, Korean) is to use bigram
>>> character pairs in the index.  While this works to a point, some believe
>>> that using proper word breaks provides better results.
>>>
>>> The "lucene-ja" glue layer between Lucene and the core SEN library broke
>>> in May of '09 when a fix was made in Lucene:
>>> http://issues.apache.org/jira/browse/LUCENE-1636
>>>
>>> Uwe S. had a very good insight for a quick fix, and I have been cleaning
>>> up some other issues with the code.  I have also spoken the author Takashi
>>> Okamoto and he is fine to have this moved from java.net to ASF; I think
>>> it will be easier for folks to find and use it if it's in ASF.
>>>
>>> I'm not quite ready to submit a patch, but the Wiki suggests emailing the
>>> list with the idea in advance.  There are some packaging questions I'll
>>> have, there's actually quite a few parts.  Also, the wiki didn't quite spell
>>> out the process to get things into contrib, beyond emailing and submitting a
>>> patch.  I also plan to eventually submit a Solr-specific wrapper to the solr
>>> dev list, to allow for dynamic config changes to be made from Solr's
>>> schema.  But since the original code was Lucene based, and it provides the
>>> broadest reach, I think having it in core Lucene would be a good start.
>>>
>>> Any comments, suggestions, or mentor volunteers?  :-)
>>>
>>> Mark
>>>
>>> --
>>> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
>>> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: Fix for Japanese SEN morphological analyzer, and moving into Contrib

Posted by Mark Bennett <mb...@ideaeng.com>.

Hello Robert,

That's a good question.  The core SEN is under LGPL, yes.  However, I didn't
need to make changes to that, though given that there are 2 versions
floating around, I think it needs a good home.

But the glue-layer is under "Apache 2.0" license, and that's the part that
needed fixing.  I think that means it's ASF / contrib compatible?

There are also 2 other ancillary libraries and some source dictionaries
which I need to research.

Working from the other direction, which might give you some ideas:
The goal is to get this more accessible.  It'd be really nice if, in a
Lucene distribution, the SEN library could be switched on as easily as CJK.
Or at the most you'd run an ant script to fetch all the parts and assemble
it.  As it stands now I think it's not used much because it's a bit complex
to setup, even prior to May '09's change, and most of the users of it
discuss it in Japanese.  So that's the goal, I'm very open to ideas on the
"how".

Mark

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

On Mon, Oct 12, 2009 at 11:10 AM, Robert Muir <rc...@gmail.com> wrote:

> Mark, does this mean Sen will be under the Apache license? (it is currently
> LGPL)
>
>
> On Mon, Oct 12, 2009 at 1:46 PM, Mark Bennett <mb...@ideaeng.com>wrote:
>
>> Hi folks,
>>
>> I've been working to fix the Japanese SEN morphological analyzer, which is
>> currently hosted at:
>> https://sen.dev.java.net
>>
>> To review, Japanese doesn't use whitespace for word breaks.  The
>> traditional approach to CJK (Chinese, Japanese, Korean) is to use bigram
>> character pairs in the index.  While this works to a point, some believe
>> that using proper word breaks provides better results.
>>
>> The "lucene-ja" glue layer between Lucene and the core SEN library broke
>> in May of '09 when a fix was made in Lucene:
>> http://issues.apache.org/jira/browse/LUCENE-1636
>>
>> Uwe S. had a very good insight for a quick fix, and I have been cleaning
>> up some other issues with the code.  I have also spoken the author Takashi
>> Okamoto and he is fine to have this moved from java.net to ASF; I think
>> it will be easier for folks to find and use it if it's in ASF.
>>
>> I'm not quite ready to submit a patch, but the Wiki suggests emailing the
>> list with the idea in advance.  There are some packaging questions I'll
>> have, there's actually quite a few parts.  Also, the wiki didn't quite spell
>> out the process to get things into contrib, beyond emailing and submitting a
>> patch.  I also plan to eventually submit a Solr-specific wrapper to the solr
>> dev list, to allow for dynamic config changes to be made from Solr's
>> schema.  But since the original code was Lucene based, and it provides the
>> broadest reach, I think having it in core Lucene would be a good start.
>>
>> Any comments, suggestions, or mentor volunteers?  :-)
>>
>> Mark
>>
>> --
>> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
>> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Re: Fix for Japanese SEN morphological analyzer, and moving into Contrib

Posted by Robert Muir <rc...@gmail.com>.

Mark, does this mean Sen will be under the Apache license? (it is currently
LGPL)

On Mon, Oct 12, 2009 at 1:46 PM, Mark Bennett <mb...@ideaeng.com> wrote:

> Hi folks,
>
> I've been working to fix the Japanese SEN morphological analyzer, which is
> currently hosted at:
> https://sen.dev.java.net
>
> To review, Japanese doesn't use whitespace for word breaks.  The
> traditional approach to CJK (Chinese, Japanese, Korean) is to use bigram
> character pairs in the index.  While this works to a point, some believe
> that using proper word breaks provides better results.
>
> The "lucene-ja" glue layer between Lucene and the core SEN library broke in
> May of '09 when a fix was made in Lucene:
> http://issues.apache.org/jira/browse/LUCENE-1636
>
> Uwe S. had a very good insight for a quick fix, and I have been cleaning up
> some other issues with the code.  I have also spoken the author Takashi
> Okamoto and he is fine to have this moved from java.net to ASF; I think it
> will be easier for folks to find and use it if it's in ASF.
>
> I'm not quite ready to submit a patch, but the Wiki suggests emailing the
> list with the idea in advance.  There are some packaging questions I'll
> have, there's actually quite a few parts.  Also, the wiki didn't quite spell
> out the process to get things into contrib, beyond emailing and submitting a
> patch.  I also plan to eventually submit a Solr-specific wrapper to the solr
> dev list, to allow for dynamic config changes to be made from Solr's
> schema.  But since the original code was Lucene based, and it provides the
> broadest reach, I think having it in core Lucene would be a good start.
>
> Any comments, suggestions, or mentor volunteers?  :-)
>
> Mark
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>



-- 
Robert Muir
rcmuir@gmail.com