You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@commons.apache.org by Michael Tobias <mi...@tobias.org.uk> on 2014/06/11 09:08:33 UTC

[CODEC] Beider Morse Phonetic Matching Bug and questions

Does anybody have a working knowledge of the coding of the Beider Morse
Phonetic Matching in the Apache Commons Codec?

 

My recent tests using Solr suggest there is a discrepancy between Steve
Morse and Alexander Beider's algorithm and the algorithm currently live in
Solr (and hence the Commons Codec).

 

I know that the source code for BMPM issued by Steve has changed several
times over the years, and I thought at first it might be that the version
used in the Commons Codec is an old version that has subsequently been
overtaken.  Should the version of the BMPM algorithm not be listed in the
Commons Codec documentation? How should version changes to the algorithm be
implemented? The algorithm is quite static now so this is probably not so
important now but surely it should be DOCUMENTED???

 

My tests now indicate that the discrepancies are NOT a version problem as
testing against a very old version 2.00 of the BMPM source code issued on 18
June 2009 still exhibits the same problem.

 

Using just a single test term the results are not good. The only saving
grace is that the most widely used version is 

 

nameType="GENERIC" ruleType="APPROX"

 

and that is a close (but not perfect) match at least for this ONE test word.

 

For the name Abram, all with languageSet="auto"

 

GENERIC APPROX - fails - misses a few tokens

Should create tokens: abram abrom avram avrom obram obrom ovram ovrom abran
abron obran obron Ybram Ybrom

Solr creates: abram abrom avram avrom obram obrom ovram ovrom abran abron
obran obron

 

GENERIC EXACT - good!

Should create tokens: abram abran

Solr creates: abram abran

 

ASHKENAZI APPROX: - fails dreadfully!

Should create tokens: abram abrom avram avrom obram obrom ovram ovrom Ybram
Ybrom ombram ombrom imbram imbrom

Solr creates: abrAm AvrAm BbrAm

 

ASHKENAZI EXACT: - good!

Should create tokens: abram

Solr creates: abram

 

SEPHARDIC APPROX: - good!

Should create tokens: abram bram abran bran avram vram

Solr creates: abram bram abran bran avram vram

 

SEPHARDIC EXACT: - good!

Should create tokens: abram abran avram

Solr creates: abram abran avram 

 

I would appreciate it if somebody with knowledge of the programming of this
functionality could investigate.

 

For the worst case I attach here a debug trace of the calculation of the
Ashkenazi Approx tokens straight from Steve Morse' implementation. It looks
like some of the final rules are not being implemented properly, or at all.
The language codes in parenthesis vary from BMPM version to version but the
resulting tokens have not changed from version 2.00 up to the current 3.02

 

Thanks

 

Michael

 

 

 

applying language rules from (rulesany) to abram using languages 2012

char codes = [#61]a [#62]b [#72]r [#61]a [#6d]m

applying rule #225
   pattern=a
   lcontext=
   rcontext=[bcdgkpstwzż]
   subst=(A|B[128])
   result=(A[2012]|B[128])

applying rule #229
   pattern=b
   lcontext=
   rcontext=
   subst=b
   result=(Ab[2012]|Bb[128])

applying rule #245
   pattern=r
   lcontext=
   rcontext=
   subst=r
   result=(Abr[2012]|Bbr[128])

applying rule #228
   pattern=a
   lcontext=
   rcontext=
   subst=A
   result=(AbrA[2012]|BbrA[128])

applying rule #240
   pattern=m
   lcontext=
   rcontext=
   subst=m
   result=(AbrAm[2012]|BbrAm[128])

after language rules: (AbrAm[2012]|BbrAm[128])


applying final rules from (exactapproxcommon plus approxcommon) to
AbrAm[2012]
no rules match for phonetic item 0 at position 0: A
no rules match for phonetic item 0 at position 1: Ab
no rules match for phonetic item 0 at position 2: Abr
no rules match for phonetic item 0 at position 3: AbrA
no rules match for phonetic item 0 at position 4: AbrAm

applying final rules from (exactapproxcommon plus approxcommon) to
BbrAm[128]
no rules match for phonetic item 1 at position 0: B
no rules match for phonetic item 1 at position 1: Bb
no rules match for phonetic item 1 at position 2: Bbr
no rules match for phonetic item 1 at position 3: BbrA
no rules match for phonetic item 1 at position 4: BbrAm

applying final rules from (approxany) to AbrAm[2012]
after applying final rule #97 to phonetic item #0 at position 0:
(a[2012]|o[2012]|Y[16]) pattern=A lcontext= rcontext= subst=(a|o|Y[16])
after applying final rule #0 to phonetic item #0 at position 1:
(ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16]) pattern=b lcontext= rcontext=
subst=(b|v[1024])
no rules match for phonetic item 0 at position 2:
(ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16])r
after applying final rule #93 to phonetic item #0 at position 3:
(abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
]|ovro[1024]|Ybra[16]|Ybro[16]) pattern=A lcontext= rcontext=[fklmnprst]$
subst=(a|o)
no rules match for phonetic item 0 at position 4:
(abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
]|ovro[1024]|Ybra[16]|Ybro[16])m

applying final rules from (approxany) to BbrAm[128]
after applying final rule #22 to phonetic item #1 at position 0:
(o[2012]|om[128]|im[128]) pattern=B lcontext= rcontext=[bp]
subst=(o|om[128]|im[128])
after applying final rule #0 to phonetic item #1 at position 1:
(ob[2012]|ov[1024]|omb[128]|imb[128]) pattern=b lcontext= rcontext=
subst=(b|v[1024])
no rules match for phonetic item 1 at position 2:
(ob[2012]|ov[1024]|omb[128]|imb[128])r
after applying final rule #93 to phonetic item #1 at position 3:
(obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
]|imbro[128]) pattern=A lcontext= rcontext=[fklmnprst]$ subst=(a|o)
no rules match for phonetic item 1 at position 4:
(obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
]|imbro[128])m

 

 

 

resulting tokens:


(abram|abrom|avram|avrom|obram|obrom|ovram|ovrom|Ybram|Ybrom|ombram|ombrom|i
mbram|imbrom)

RE: [CODEC] Beider Morse Phonetic Matching Bug and questions

Posted by Michael Tobias <mi...@tobias.org.uk>.

I am a newbie here and am unable to offer any views on the significance of the license situation, but it was the intention of Steve Morse and Alexander Beider to offer their algorithm to the wider developer world and so I am not sure why they would not be willing to re-licence the code to the Apache Software Foundation.

I am in contact with both guys having worked with them for many years so if you point me to any relevant Apache Software Foundation License documentation and explain exactly what is needed I will be happy to send it to them, discuss with them, and hopefully get this sorted as quickly as possible.

Michael 

-----Original Message-----
From: Thomas Neidhart [mailto:thomas.neidhart@gmail.com] 
Sent: 11 June 2014 12:56
To: Commons Developers List
Subject: Re: [CODEC] Beider Morse Phonetic Matching Bug and questions

Hi,

as already commented on https://issues.apache.org/jira/browse/CODEC-187 the problem is related to some wrongly ported rule files from the original source.

This otoh, creates a serious problem for us, as it looks like that the Beider-Morse phonetic matching encoder in commons-codec is derived work from a php codebase released under the GPLv3 licence.
The original codebase is available at http://stevemorse.org/phoneticinfo.htm.
While investigating the bug and comparing our rule file with the ones from the origina codebase it is quite clear that at least these are identical.

The author of the patch (see https://issues.apache.org/jira/browse/CODEC-125)
ported the code and applied the Apache license, but the license of the original codebase was never considered or discussed.

This is quite serious I guess, as we have already released the code. We can ask the original authors to re-license their code to the Apache Software Foundation under a compatible license, but I wonder if they are willing to do so.
This encoder is also used a lot in lucene/solr so it might have even larger implications.

Any ideas how to proceed or if a re-licensing would be sufficient in this case?

Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [CODEC] Beider Morse Phonetic Matching Bug and questions

Posted by Gary Gregory <ga...@gmail.com>.

IANL, but there are plenty of components out there released under multiple
licenses.

Would the easiest solution be for the BM authors to add the Apache 2.0
license to their artifacts? They can keep the current license if they are
happy with it and add the ASL 2.

Otherwise, we'll have to delete BM from [codec] and cut a new release.

Gary


On Wed, Jun 11, 2014 at 7:56 AM, Thomas Neidhart <th...@gmail.com>
wrote:

> Hi,
>
> as already commented on https://issues.apache.org/jira/browse/CODEC-187
> the
> problem is related to some wrongly ported rule files from the original
> source.
>
> This otoh, creates a serious problem for us, as it looks like that the
> Beider-Morse phonetic matching encoder in commons-codec is derived work
> from a php codebase released under the GPLv3 licence.
> The original codebase is available at
> http://stevemorse.org/phoneticinfo.htm.
> While investigating the bug and comparing our rule file with the ones from
> the origina codebase it is quite clear that at least these are identical.
>
> The author of the patch (see
> https://issues.apache.org/jira/browse/CODEC-125)
> ported the code and applied the Apache license, but the license of the
> original codebase was never considered or discussed.
>
> This is quite serious I guess, as we have already released the code. We can
> ask the original authors to re-license their code to the Apache Software
> Foundation under a compatible license, but I wonder if they are willing to
> do so.
> This encoder is also used a lot in lucene/solr so it might have even larger
> implications.
>
> Any ideas how to proceed or if a re-licensing would be sufficient in this
> case?
>
> Thomas
>
>
> On Wed, Jun 11, 2014 at 9:08 AM, Michael Tobias <mi...@tobias.org.uk>
> wrote:
>
> > Does anybody have a working knowledge of the coding of the Beider Morse
> > Phonetic Matching in the Apache Commons Codec?
> >
> >
> >
> > My recent tests using Solr suggest there is a discrepancy between Steve
> > Morse and Alexander Beider's algorithm and the algorithm currently live
> in
> > Solr (and hence the Commons Codec).
> >
> >
> >
> > I know that the source code for BMPM issued by Steve has changed several
> > times over the years, and I thought at first it might be that the version
> > used in the Commons Codec is an old version that has subsequently been
> > overtaken.  Should the version of the BMPM algorithm not be listed in the
> > Commons Codec documentation? How should version changes to the algorithm
> be
> > implemented? The algorithm is quite static now so this is probably not so
> > important now but surely it should be DOCUMENTED???
> >
> >
> >
> > My tests now indicate that the discrepancies are NOT a version problem as
> > testing against a very old version 2.00 of the BMPM source code issued on
> > 18
> > June 2009 still exhibits the same problem.
> >
> >
> >
> > Using just a single test term the results are not good. The only saving
> > grace is that the most widely used version is
> >
> >
> >
> > nameType="GENERIC" ruleType="APPROX"
> >
> >
> >
> > and that is a close (but not perfect) match at least for this ONE test
> > word.
> >
> >
> >
> > For the name Abram, all with languageSet="auto"
> >
> >
> >
> > GENERIC APPROX - fails - misses a few tokens
> >
> > Should create tokens: abram abrom avram avrom obram obrom ovram ovrom
> abran
> > abron obran obron Ybram Ybrom
> >
> > Solr creates: abram abrom avram avrom obram obrom ovram ovrom abran abron
> > obran obron
> >
> >
> >
> > GENERIC EXACT - good!
> >
> > Should create tokens: abram abran
> >
> > Solr creates: abram abran
> >
> >
> >
> > ASHKENAZI APPROX: - fails dreadfully!
> >
> > Should create tokens: abram abrom avram avrom obram obrom ovram ovrom
> Ybram
> > Ybrom ombram ombrom imbram imbrom
> >
> > Solr creates: abrAm AvrAm BbrAm
> >
> >
> >
> > ASHKENAZI EXACT: - good!
> >
> > Should create tokens: abram
> >
> > Solr creates: abram
> >
> >
> >
> > SEPHARDIC APPROX: - good!
> >
> > Should create tokens: abram bram abran bran avram vram
> >
> > Solr creates: abram bram abran bran avram vram
> >
> >
> >
> > SEPHARDIC EXACT: - good!
> >
> > Should create tokens: abram abran avram
> >
> > Solr creates: abram abran avram
> >
> >
> >
> > I would appreciate it if somebody with knowledge of the programming of
> this
> > functionality could investigate.
> >
> >
> >
> > For the worst case I attach here a debug trace of the calculation of the
> > Ashkenazi Approx tokens straight from Steve Morse' implementation. It
> looks
> > like some of the final rules are not being implemented properly, or at
> all.
> > The language codes in parenthesis vary from BMPM version to version but
> the
> > resulting tokens have not changed from version 2.00 up to the current
> 3.02
> >
> >
> >
> > Thanks
> >
> >
> >
> > Michael
> >
> >
> >
> >
> >
> >
> >
> > applying language rules from (rulesany) to abram using languages 2012
> >
> > char codes = [#61]a [#62]b [#72]r [#61]a [#6d]m
> >
> > applying rule #225
> >    pattern=a
> >    lcontext=
> >    rcontext=[bcdgkpstwzż]
> >    subst=(A|B[128])
> >    result=(A[2012]|B[128])
> >
> > applying rule #229
> >    pattern=b
> >    lcontext=
> >    rcontext=
> >    subst=b
> >    result=(Ab[2012]|Bb[128])
> >
> > applying rule #245
> >    pattern=r
> >    lcontext=
> >    rcontext=
> >    subst=r
> >    result=(Abr[2012]|Bbr[128])
> >
> > applying rule #228
> >    pattern=a
> >    lcontext=
> >    rcontext=
> >    subst=A
> >    result=(AbrA[2012]|BbrA[128])
> >
> > applying rule #240
> >    pattern=m
> >    lcontext=
> >    rcontext=
> >    subst=m
> >    result=(AbrAm[2012]|BbrAm[128])
> >
> > after language rules: (AbrAm[2012]|BbrAm[128])
> >
> >
> > applying final rules from (exactapproxcommon plus approxcommon) to
> > AbrAm[2012]
> > no rules match for phonetic item 0 at position 0: A
> > no rules match for phonetic item 0 at position 1: Ab
> > no rules match for phonetic item 0 at position 2: Abr
> > no rules match for phonetic item 0 at position 3: AbrA
> > no rules match for phonetic item 0 at position 4: AbrAm
> >
> > applying final rules from (exactapproxcommon plus approxcommon) to
> > BbrAm[128]
> > no rules match for phonetic item 1 at position 0: B
> > no rules match for phonetic item 1 at position 1: Bb
> > no rules match for phonetic item 1 at position 2: Bbr
> > no rules match for phonetic item 1 at position 3: BbrA
> > no rules match for phonetic item 1 at position 4: BbrAm
> >
> > applying final rules from (approxany) to AbrAm[2012]
> > after applying final rule #97 to phonetic item #0 at position 0:
> > (a[2012]|o[2012]|Y[16]) pattern=A lcontext= rcontext= subst=(a|o|Y[16])
> > after applying final rule #0 to phonetic item #0 at position 1:
> > (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16]) pattern=b lcontext=
> rcontext=
> > subst=(b|v[1024])
> > no rules match for phonetic item 0 at position 2:
> > (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16])r
> > after applying final rule #93 to phonetic item #0 at position 3:
> >
> >
> (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
> > ]|ovro[1024]|Ybra[16]|Ybro[16]) pattern=A lcontext= rcontext=[fklmnprst]$
> > subst=(a|o)
> > no rules match for phonetic item 0 at position 4:
> >
> >
> (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
> > ]|ovro[1024]|Ybra[16]|Ybro[16])m
> >
> > applying final rules from (approxany) to BbrAm[128]
> > after applying final rule #22 to phonetic item #1 at position 0:
> > (o[2012]|om[128]|im[128]) pattern=B lcontext= rcontext=[bp]
> > subst=(o|om[128]|im[128])
> > after applying final rule #0 to phonetic item #1 at position 1:
> > (ob[2012]|ov[1024]|omb[128]|imb[128]) pattern=b lcontext= rcontext=
> > subst=(b|v[1024])
> > no rules match for phonetic item 1 at position 2:
> > (ob[2012]|ov[1024]|omb[128]|imb[128])r
> > after applying final rule #93 to phonetic item #1 at position 3:
> >
> >
> (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
> > ]|imbro[128]) pattern=A lcontext= rcontext=[fklmnprst]$ subst=(a|o)
> > no rules match for phonetic item 1 at position 4:
> >
> >
> (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
> > ]|imbro[128])m
> >
> >
> >
> >
> >
> >
> >
> > resulting tokens:
> >
> >
> >
> >
> (abram|abrom|avram|avrom|obram|obrom|ovram|ovrom|Ybram|Ybrom|ombram|ombrom|i
> > mbram|imbrom)
> >
> >
>



-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition
<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CODEC] Beider Morse Phonetic Matching Bug and questions

Posted by Thomas Neidhart <th...@gmail.com>.

He stated several times that the contributed code is a port from php, but
it looks like that everybody assumed he was the copyright owner of the php
code (which he obviously isn't).

It is important to add links (e.g. papers or reference implementations) for
such code / algorithms, especially in the case of ported code.
Not only to have a clear trace of the license situation but also to be able
to backport changes/fixes from the original source.

Regarding the re-licensing of the original source:

I hope the authors agree, but usually there is a reason why code is
published under GPL rather than BSD/MIT/Apache style licenses.
An option would be to explicitly grant the ASF the permission to use their
code (including the rule files) so that they do not have to re-publish
their own code.

Regarding a clean-room implementation: I do not think this is feasible, as
the core of the algorithm are the associated rule files which also do not
appear in the paper itself.

As the maintainer of codec, do you want to contact the authors about this
issue?

Thomas



On Wed, Jun 11, 2014 at 4:54 PM, Gary Gregory <ga...@gmail.com>
wrote:

> This also raises a process issue. The [CODEC-125] tickets states "I have
> implemented Beider Morse Phonetic Matching as a codec against the
> commons-codec svn trunk. I would like to contribute this to commons-codec."
>
> Also the MP states: "Patch with license granted". So he granted the license
> for his Java version. But because it was derived from PHP under GPLv3,
> which is what we think the issue is.
>
> Arg.
>
> Gary
>
>
> On Wed, Jun 11, 2014 at 7:56 AM, Thomas Neidhart <
> thomas.neidhart@gmail.com>
> wrote:
>
> > Hi,
> >
> > as already commented on https://issues.apache.org/jira/browse/CODEC-187
> > the
> > problem is related to some wrongly ported rule files from the original
> > source.
> >
> > This otoh, creates a serious problem for us, as it looks like that the
> > Beider-Morse phonetic matching encoder in commons-codec is derived work
> > from a php codebase released under the GPLv3 licence.
> > The original codebase is available at
> > http://stevemorse.org/phoneticinfo.htm.
> > While investigating the bug and comparing our rule file with the ones
> from
> > the origina codebase it is quite clear that at least these are identical.
> >
> > The author of the patch (see
> > https://issues.apache.org/jira/browse/CODEC-125)
> > ported the code and applied the Apache license, but the license of the
> > original codebase was never considered or discussed.
> >
> > This is quite serious I guess, as we have already released the code. We
> can
> > ask the original authors to re-license their code to the Apache Software
> > Foundation under a compatible license, but I wonder if they are willing
> to
> > do so.
> > This encoder is also used a lot in lucene/solr so it might have even
> larger
> > implications.
> >
> > Any ideas how to proceed or if a re-licensing would be sufficient in this
> > case?
> >
> > Thomas
> >
> >
> > On Wed, Jun 11, 2014 at 9:08 AM, Michael Tobias <mi...@tobias.org.uk>
> > wrote:
> >
> > > Does anybody have a working knowledge of the coding of the Beider Morse
> > > Phonetic Matching in the Apache Commons Codec?
> > >
> > >
> > >
> > > My recent tests using Solr suggest there is a discrepancy between Steve
> > > Morse and Alexander Beider's algorithm and the algorithm currently live
> > in
> > > Solr (and hence the Commons Codec).
> > >
> > >
> > >
> > > I know that the source code for BMPM issued by Steve has changed
> several
> > > times over the years, and I thought at first it might be that the
> version
> > > used in the Commons Codec is an old version that has subsequently been
> > > overtaken.  Should the version of the BMPM algorithm not be listed in
> the
> > > Commons Codec documentation? How should version changes to the
> algorithm
> > be
> > > implemented? The algorithm is quite static now so this is probably not
> so
> > > important now but surely it should be DOCUMENTED???
> > >
> > >
> > >
> > > My tests now indicate that the discrepancies are NOT a version problem
> as
> > > testing against a very old version 2.00 of the BMPM source code issued
> on
> > > 18
> > > June 2009 still exhibits the same problem.
> > >
> > >
> > >
> > > Using just a single test term the results are not good. The only saving
> > > grace is that the most widely used version is
> > >
> > >
> > >
> > > nameType="GENERIC" ruleType="APPROX"
> > >
> > >
> > >
> > > and that is a close (but not perfect) match at least for this ONE test
> > > word.
> > >
> > >
> > >
> > > For the name Abram, all with languageSet="auto"
> > >
> > >
> > >
> > > GENERIC APPROX - fails - misses a few tokens
> > >
> > > Should create tokens: abram abrom avram avrom obram obrom ovram ovrom
> > abran
> > > abron obran obron Ybram Ybrom
> > >
> > > Solr creates: abram abrom avram avrom obram obrom ovram ovrom abran
> abron
> > > obran obron
> > >
> > >
> > >
> > > GENERIC EXACT - good!
> > >
> > > Should create tokens: abram abran
> > >
> > > Solr creates: abram abran
> > >
> > >
> > >
> > > ASHKENAZI APPROX: - fails dreadfully!
> > >
> > > Should create tokens: abram abrom avram avrom obram obrom ovram ovrom
> > Ybram
> > > Ybrom ombram ombrom imbram imbrom
> > >
> > > Solr creates: abrAm AvrAm BbrAm
> > >
> > >
> > >
> > > ASHKENAZI EXACT: - good!
> > >
> > > Should create tokens: abram
> > >
> > > Solr creates: abram
> > >
> > >
> > >
> > > SEPHARDIC APPROX: - good!
> > >
> > > Should create tokens: abram bram abran bran avram vram
> > >
> > > Solr creates: abram bram abran bran avram vram
> > >
> > >
> > >
> > > SEPHARDIC EXACT: - good!
> > >
> > > Should create tokens: abram abran avram
> > >
> > > Solr creates: abram abran avram
> > >
> > >
> > >
> > > I would appreciate it if somebody with knowledge of the programming of
> > this
> > > functionality could investigate.
> > >
> > >
> > >
> > > For the worst case I attach here a debug trace of the calculation of
> the
> > > Ashkenazi Approx tokens straight from Steve Morse' implementation. It
> > looks
> > > like some of the final rules are not being implemented properly, or at
> > all.
> > > The language codes in parenthesis vary from BMPM version to version but
> > the
> > > resulting tokens have not changed from version 2.00 up to the current
> > 3.02
> > >
> > >
> > >
> > > Thanks
> > >
> > >
> > >
> > > Michael
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > applying language rules from (rulesany) to abram using languages 2012
> > >
> > > char codes = [#61]a [#62]b [#72]r [#61]a [#6d]m
> > >
> > > applying rule #225
> > >    pattern=a
> > >    lcontext=
> > >    rcontext=[bcdgkpstwzż]
> > >    subst=(A|B[128])
> > >    result=(A[2012]|B[128])
> > >
> > > applying rule #229
> > >    pattern=b
> > >    lcontext=
> > >    rcontext=
> > >    subst=b
> > >    result=(Ab[2012]|Bb[128])
> > >
> > > applying rule #245
> > >    pattern=r
> > >    lcontext=
> > >    rcontext=
> > >    subst=r
> > >    result=(Abr[2012]|Bbr[128])
> > >
> > > applying rule #228
> > >    pattern=a
> > >    lcontext=
> > >    rcontext=
> > >    subst=A
> > >    result=(AbrA[2012]|BbrA[128])
> > >
> > > applying rule #240
> > >    pattern=m
> > >    lcontext=
> > >    rcontext=
> > >    subst=m
> > >    result=(AbrAm[2012]|BbrAm[128])
> > >
> > > after language rules: (AbrAm[2012]|BbrAm[128])
> > >
> > >
> > > applying final rules from (exactapproxcommon plus approxcommon) to
> > > AbrAm[2012]
> > > no rules match for phonetic item 0 at position 0: A
> > > no rules match for phonetic item 0 at position 1: Ab
> > > no rules match for phonetic item 0 at position 2: Abr
> > > no rules match for phonetic item 0 at position 3: AbrA
> > > no rules match for phonetic item 0 at position 4: AbrAm
> > >
> > > applying final rules from (exactapproxcommon plus approxcommon) to
> > > BbrAm[128]
> > > no rules match for phonetic item 1 at position 0: B
> > > no rules match for phonetic item 1 at position 1: Bb
> > > no rules match for phonetic item 1 at position 2: Bbr
> > > no rules match for phonetic item 1 at position 3: BbrA
> > > no rules match for phonetic item 1 at position 4: BbrAm
> > >
> > > applying final rules from (approxany) to AbrAm[2012]
> > > after applying final rule #97 to phonetic item #0 at position 0:
> > > (a[2012]|o[2012]|Y[16]) pattern=A lcontext= rcontext= subst=(a|o|Y[16])
> > > after applying final rule #0 to phonetic item #0 at position 1:
> > > (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16]) pattern=b lcontext=
> > rcontext=
> > > subst=(b|v[1024])
> > > no rules match for phonetic item 0 at position 2:
> > > (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16])r
> > > after applying final rule #93 to phonetic item #0 at position 3:
> > >
> > >
> >
> (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
> > > ]|ovro[1024]|Ybra[16]|Ybro[16]) pattern=A lcontext=
> rcontext=[fklmnprst]$
> > > subst=(a|o)
> > > no rules match for phonetic item 0 at position 4:
> > >
> > >
> >
> (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
> > > ]|ovro[1024]|Ybra[16]|Ybro[16])m
> > >
> > > applying final rules from (approxany) to BbrAm[128]
> > > after applying final rule #22 to phonetic item #1 at position 0:
> > > (o[2012]|om[128]|im[128]) pattern=B lcontext= rcontext=[bp]
> > > subst=(o|om[128]|im[128])
> > > after applying final rule #0 to phonetic item #1 at position 1:
> > > (ob[2012]|ov[1024]|omb[128]|imb[128]) pattern=b lcontext= rcontext=
> > > subst=(b|v[1024])
> > > no rules match for phonetic item 1 at position 2:
> > > (ob[2012]|ov[1024]|omb[128]|imb[128])r
> > > after applying final rule #93 to phonetic item #1 at position 3:
> > >
> > >
> >
> (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
> > > ]|imbro[128]) pattern=A lcontext= rcontext=[fklmnprst]$ subst=(a|o)
> > > no rules match for phonetic item 1 at position 4:
> > >
> > >
> >
> (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
> > > ]|imbro[128])m
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > resulting tokens:
> > >
> > >
> > >
> > >
> >
> (abram|abrom|avram|avrom|obram|obrom|ovram|ovrom|Ybram|Ybrom|ombram|ombrom|i
> > > mbram|imbrom)
> > >
> > >
> >
>
>
>
> --
> E-Mail: garydgregory@gmail.com | ggregory@apache.org
> Java Persistence with Hibernate, Second Edition
> <http://www.manning.com/bauer3/>
> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> Spring Batch in Action <http://www.manning.com/templier/>
> Blog: http://garygregory.wordpress.com
> Home: http://garygregory.com/
> Tweet! http://twitter.com/GaryGregory
>

Re: [CODEC] Beider Morse Phonetic Matching Bug and questions

Posted by Gary Gregory <ga...@gmail.com>.

This also raises a process issue. The [CODEC-125] tickets states "I have
implemented Beider Morse Phonetic Matching as a codec against the
commons-codec svn trunk. I would like to contribute this to commons-codec."

Also the MP states: "Patch with license granted". So he granted the license
for his Java version. But because it was derived from PHP under GPLv3,
which is what we think the issue is.

Arg.

Gary


On Wed, Jun 11, 2014 at 7:56 AM, Thomas Neidhart <th...@gmail.com>
wrote:

> Hi,
>
> as already commented on https://issues.apache.org/jira/browse/CODEC-187
> the
> problem is related to some wrongly ported rule files from the original
> source.
>
> This otoh, creates a serious problem for us, as it looks like that the
> Beider-Morse phonetic matching encoder in commons-codec is derived work
> from a php codebase released under the GPLv3 licence.
> The original codebase is available at
> http://stevemorse.org/phoneticinfo.htm.
> While investigating the bug and comparing our rule file with the ones from
> the origina codebase it is quite clear that at least these are identical.
>
> The author of the patch (see
> https://issues.apache.org/jira/browse/CODEC-125)
> ported the code and applied the Apache license, but the license of the
> original codebase was never considered or discussed.
>
> This is quite serious I guess, as we have already released the code. We can
> ask the original authors to re-license their code to the Apache Software
> Foundation under a compatible license, but I wonder if they are willing to
> do so.
> This encoder is also used a lot in lucene/solr so it might have even larger
> implications.
>
> Any ideas how to proceed or if a re-licensing would be sufficient in this
> case?
>
> Thomas
>
>
> On Wed, Jun 11, 2014 at 9:08 AM, Michael Tobias <mi...@tobias.org.uk>
> wrote:
>
> > Does anybody have a working knowledge of the coding of the Beider Morse
> > Phonetic Matching in the Apache Commons Codec?
> >
> >
> >
> > My recent tests using Solr suggest there is a discrepancy between Steve
> > Morse and Alexander Beider's algorithm and the algorithm currently live
> in
> > Solr (and hence the Commons Codec).
> >
> >
> >
> > I know that the source code for BMPM issued by Steve has changed several
> > times over the years, and I thought at first it might be that the version
> > used in the Commons Codec is an old version that has subsequently been
> > overtaken.  Should the version of the BMPM algorithm not be listed in the
> > Commons Codec documentation? How should version changes to the algorithm
> be
> > implemented? The algorithm is quite static now so this is probably not so
> > important now but surely it should be DOCUMENTED???
> >
> >
> >
> > My tests now indicate that the discrepancies are NOT a version problem as
> > testing against a very old version 2.00 of the BMPM source code issued on
> > 18
> > June 2009 still exhibits the same problem.
> >
> >
> >
> > Using just a single test term the results are not good. The only saving
> > grace is that the most widely used version is
> >
> >
> >
> > nameType="GENERIC" ruleType="APPROX"
> >
> >
> >
> > and that is a close (but not perfect) match at least for this ONE test
> > word.
> >
> >
> >
> > For the name Abram, all with languageSet="auto"
> >
> >
> >
> > GENERIC APPROX - fails - misses a few tokens
> >
> > Should create tokens: abram abrom avram avrom obram obrom ovram ovrom
> abran
> > abron obran obron Ybram Ybrom
> >
> > Solr creates: abram abrom avram avrom obram obrom ovram ovrom abran abron
> > obran obron
> >
> >
> >
> > GENERIC EXACT - good!
> >
> > Should create tokens: abram abran
> >
> > Solr creates: abram abran
> >
> >
> >
> > ASHKENAZI APPROX: - fails dreadfully!
> >
> > Should create tokens: abram abrom avram avrom obram obrom ovram ovrom
> Ybram
> > Ybrom ombram ombrom imbram imbrom
> >
> > Solr creates: abrAm AvrAm BbrAm
> >
> >
> >
> > ASHKENAZI EXACT: - good!
> >
> > Should create tokens: abram
> >
> > Solr creates: abram
> >
> >
> >
> > SEPHARDIC APPROX: - good!
> >
> > Should create tokens: abram bram abran bran avram vram
> >
> > Solr creates: abram bram abran bran avram vram
> >
> >
> >
> > SEPHARDIC EXACT: - good!
> >
> > Should create tokens: abram abran avram
> >
> > Solr creates: abram abran avram
> >
> >
> >
> > I would appreciate it if somebody with knowledge of the programming of
> this
> > functionality could investigate.
> >
> >
> >
> > For the worst case I attach here a debug trace of the calculation of the
> > Ashkenazi Approx tokens straight from Steve Morse' implementation. It
> looks
> > like some of the final rules are not being implemented properly, or at
> all.
> > The language codes in parenthesis vary from BMPM version to version but
> the
> > resulting tokens have not changed from version 2.00 up to the current
> 3.02
> >
> >
> >
> > Thanks
> >
> >
> >
> > Michael
> >
> >
> >
> >
> >
> >
> >
> > applying language rules from (rulesany) to abram using languages 2012
> >
> > char codes = [#61]a [#62]b [#72]r [#61]a [#6d]m
> >
> > applying rule #225
> >    pattern=a
> >    lcontext=
> >    rcontext=[bcdgkpstwzż]
> >    subst=(A|B[128])
> >    result=(A[2012]|B[128])
> >
> > applying rule #229
> >    pattern=b
> >    lcontext=
> >    rcontext=
> >    subst=b
> >    result=(Ab[2012]|Bb[128])
> >
> > applying rule #245
> >    pattern=r
> >    lcontext=
> >    rcontext=
> >    subst=r
> >    result=(Abr[2012]|Bbr[128])
> >
> > applying rule #228
> >    pattern=a
> >    lcontext=
> >    rcontext=
> >    subst=A
> >    result=(AbrA[2012]|BbrA[128])
> >
> > applying rule #240
> >    pattern=m
> >    lcontext=
> >    rcontext=
> >    subst=m
> >    result=(AbrAm[2012]|BbrAm[128])
> >
> > after language rules: (AbrAm[2012]|BbrAm[128])
> >
> >
> > applying final rules from (exactapproxcommon plus approxcommon) to
> > AbrAm[2012]
> > no rules match for phonetic item 0 at position 0: A
> > no rules match for phonetic item 0 at position 1: Ab
> > no rules match for phonetic item 0 at position 2: Abr
> > no rules match for phonetic item 0 at position 3: AbrA
> > no rules match for phonetic item 0 at position 4: AbrAm
> >
> > applying final rules from (exactapproxcommon plus approxcommon) to
> > BbrAm[128]
> > no rules match for phonetic item 1 at position 0: B
> > no rules match for phonetic item 1 at position 1: Bb
> > no rules match for phonetic item 1 at position 2: Bbr
> > no rules match for phonetic item 1 at position 3: BbrA
> > no rules match for phonetic item 1 at position 4: BbrAm
> >
> > applying final rules from (approxany) to AbrAm[2012]
> > after applying final rule #97 to phonetic item #0 at position 0:
> > (a[2012]|o[2012]|Y[16]) pattern=A lcontext= rcontext= subst=(a|o|Y[16])
> > after applying final rule #0 to phonetic item #0 at position 1:
> > (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16]) pattern=b lcontext=
> rcontext=
> > subst=(b|v[1024])
> > no rules match for phonetic item 0 at position 2:
> > (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16])r
> > after applying final rule #93 to phonetic item #0 at position 3:
> >
> >
> (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
> > ]|ovro[1024]|Ybra[16]|Ybro[16]) pattern=A lcontext= rcontext=[fklmnprst]$
> > subst=(a|o)
> > no rules match for phonetic item 0 at position 4:
> >
> >
> (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
> > ]|ovro[1024]|Ybra[16]|Ybro[16])m
> >
> > applying final rules from (approxany) to BbrAm[128]
> > after applying final rule #22 to phonetic item #1 at position 0:
> > (o[2012]|om[128]|im[128]) pattern=B lcontext= rcontext=[bp]
> > subst=(o|om[128]|im[128])
> > after applying final rule #0 to phonetic item #1 at position 1:
> > (ob[2012]|ov[1024]|omb[128]|imb[128]) pattern=b lcontext= rcontext=
> > subst=(b|v[1024])
> > no rules match for phonetic item 1 at position 2:
> > (ob[2012]|ov[1024]|omb[128]|imb[128])r
> > after applying final rule #93 to phonetic item #1 at position 3:
> >
> >
> (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
> > ]|imbro[128]) pattern=A lcontext= rcontext=[fklmnprst]$ subst=(a|o)
> > no rules match for phonetic item 1 at position 4:
> >
> >
> (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
> > ]|imbro[128])m
> >
> >
> >
> >
> >
> >
> >
> > resulting tokens:
> >
> >
> >
> >
> (abram|abrom|avram|avrom|obram|obrom|ovram|ovrom|Ybram|Ybrom|ombram|ombrom|i
> > mbram|imbrom)
> >
> >
>



-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition
<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CODEC] Beider Morse Phonetic Matching Bug and questions

Posted by Mark Thomas <ma...@apache.org>.

On 11/06/2014 12:56, Thomas Neidhart wrote:
> Hi,
> 
> as already commented on https://issues.apache.org/jira/browse/CODEC-187 the
> problem is related to some wrongly ported rule files from the original
> source.
> 
> This otoh, creates a serious problem for us, as it looks like that the
> Beider-Morse phonetic matching encoder in commons-codec is derived work
> from a php codebase released under the GPLv3 licence.
> The original codebase is available at http://stevemorse.org/phoneticinfo.htm.
> While investigating the bug and comparing our rule file with the ones from
> the origina codebase it is quite clear that at least these are identical.
> 
> The author of the patch (see https://issues.apache.org/jira/browse/CODEC-125)
> ported the code and applied the Apache license, but the license of the
> original codebase was never considered or discussed.
> 
> This is quite serious I guess, as we have already released the code. We can
> ask the original authors to re-license their code to the Apache Software
> Foundation under a compatible license, but I wonder if they are willing to
> do so.
> This encoder is also used a lot in lucene/solr so it might have even larger
> implications.
> 
> Any ideas how to proceed or if a re-licensing would be sufficient in this
> case?

Re-licensing or permission from the original authors would be
sufficient. If that is not forthcoming then there is no option but to
delete the code.

Replacing any removed code with a 'clean-room' implementation would be
acceptable but in that case the removal of the current code must not
wait for any replacement.

Mark

> 
> Thomas
> 
> 
> On Wed, Jun 11, 2014 at 9:08 AM, Michael Tobias <mi...@tobias.org.uk>
> wrote:
> 
>> Does anybody have a working knowledge of the coding of the Beider Morse
>> Phonetic Matching in the Apache Commons Codec?
>>
>>
>>
>> My recent tests using Solr suggest there is a discrepancy between Steve
>> Morse and Alexander Beider's algorithm and the algorithm currently live in
>> Solr (and hence the Commons Codec).
>>
>>
>>
>> I know that the source code for BMPM issued by Steve has changed several
>> times over the years, and I thought at first it might be that the version
>> used in the Commons Codec is an old version that has subsequently been
>> overtaken.  Should the version of the BMPM algorithm not be listed in the
>> Commons Codec documentation? How should version changes to the algorithm be
>> implemented? The algorithm is quite static now so this is probably not so
>> important now but surely it should be DOCUMENTED???
>>
>>
>>
>> My tests now indicate that the discrepancies are NOT a version problem as
>> testing against a very old version 2.00 of the BMPM source code issued on
>> 18
>> June 2009 still exhibits the same problem.
>>
>>
>>
>> Using just a single test term the results are not good. The only saving
>> grace is that the most widely used version is
>>
>>
>>
>> nameType="GENERIC" ruleType="APPROX"
>>
>>
>>
>> and that is a close (but not perfect) match at least for this ONE test
>> word.
>>
>>
>>
>> For the name Abram, all with languageSet="auto"
>>
>>
>>
>> GENERIC APPROX - fails - misses a few tokens
>>
>> Should create tokens: abram abrom avram avrom obram obrom ovram ovrom abran
>> abron obran obron Ybram Ybrom
>>
>> Solr creates: abram abrom avram avrom obram obrom ovram ovrom abran abron
>> obran obron
>>
>>
>>
>> GENERIC EXACT - good!
>>
>> Should create tokens: abram abran
>>
>> Solr creates: abram abran
>>
>>
>>
>> ASHKENAZI APPROX: - fails dreadfully!
>>
>> Should create tokens: abram abrom avram avrom obram obrom ovram ovrom Ybram
>> Ybrom ombram ombrom imbram imbrom
>>
>> Solr creates: abrAm AvrAm BbrAm
>>
>>
>>
>> ASHKENAZI EXACT: - good!
>>
>> Should create tokens: abram
>>
>> Solr creates: abram
>>
>>
>>
>> SEPHARDIC APPROX: - good!
>>
>> Should create tokens: abram bram abran bran avram vram
>>
>> Solr creates: abram bram abran bran avram vram
>>
>>
>>
>> SEPHARDIC EXACT: - good!
>>
>> Should create tokens: abram abran avram
>>
>> Solr creates: abram abran avram
>>
>>
>>
>> I would appreciate it if somebody with knowledge of the programming of this
>> functionality could investigate.
>>
>>
>>
>> For the worst case I attach here a debug trace of the calculation of the
>> Ashkenazi Approx tokens straight from Steve Morse' implementation. It looks
>> like some of the final rules are not being implemented properly, or at all.
>> The language codes in parenthesis vary from BMPM version to version but the
>> resulting tokens have not changed from version 2.00 up to the current 3.02
>>
>>
>>
>> Thanks
>>
>>
>>
>> Michael
>>
>>
>>
>>
>>
>>
>>
>> applying language rules from (rulesany) to abram using languages 2012
>>
>> char codes = [#61]a [#62]b [#72]r [#61]a [#6d]m
>>
>> applying rule #225
>>    pattern=a
>>    lcontext=
>>    rcontext=[bcdgkpstwzż]
>>    subst=(A|B[128])
>>    result=(A[2012]|B[128])
>>
>> applying rule #229
>>    pattern=b
>>    lcontext=
>>    rcontext=
>>    subst=b
>>    result=(Ab[2012]|Bb[128])
>>
>> applying rule #245
>>    pattern=r
>>    lcontext=
>>    rcontext=
>>    subst=r
>>    result=(Abr[2012]|Bbr[128])
>>
>> applying rule #228
>>    pattern=a
>>    lcontext=
>>    rcontext=
>>    subst=A
>>    result=(AbrA[2012]|BbrA[128])
>>
>> applying rule #240
>>    pattern=m
>>    lcontext=
>>    rcontext=
>>    subst=m
>>    result=(AbrAm[2012]|BbrAm[128])
>>
>> after language rules: (AbrAm[2012]|BbrAm[128])
>>
>>
>> applying final rules from (exactapproxcommon plus approxcommon) to
>> AbrAm[2012]
>> no rules match for phonetic item 0 at position 0: A
>> no rules match for phonetic item 0 at position 1: Ab
>> no rules match for phonetic item 0 at position 2: Abr
>> no rules match for phonetic item 0 at position 3: AbrA
>> no rules match for phonetic item 0 at position 4: AbrAm
>>
>> applying final rules from (exactapproxcommon plus approxcommon) to
>> BbrAm[128]
>> no rules match for phonetic item 1 at position 0: B
>> no rules match for phonetic item 1 at position 1: Bb
>> no rules match for phonetic item 1 at position 2: Bbr
>> no rules match for phonetic item 1 at position 3: BbrA
>> no rules match for phonetic item 1 at position 4: BbrAm
>>
>> applying final rules from (approxany) to AbrAm[2012]
>> after applying final rule #97 to phonetic item #0 at position 0:
>> (a[2012]|o[2012]|Y[16]) pattern=A lcontext= rcontext= subst=(a|o|Y[16])
>> after applying final rule #0 to phonetic item #0 at position 1:
>> (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16]) pattern=b lcontext= rcontext=
>> subst=(b|v[1024])
>> no rules match for phonetic item 0 at position 2:
>> (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16])r
>> after applying final rule #93 to phonetic item #0 at position 3:
>>
>> (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
>> ]|ovro[1024]|Ybra[16]|Ybro[16]) pattern=A lcontext= rcontext=[fklmnprst]$
>> subst=(a|o)
>> no rules match for phonetic item 0 at position 4:
>>
>> (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
>> ]|ovro[1024]|Ybra[16]|Ybro[16])m
>>
>> applying final rules from (approxany) to BbrAm[128]
>> after applying final rule #22 to phonetic item #1 at position 0:
>> (o[2012]|om[128]|im[128]) pattern=B lcontext= rcontext=[bp]
>> subst=(o|om[128]|im[128])
>> after applying final rule #0 to phonetic item #1 at position 1:
>> (ob[2012]|ov[1024]|omb[128]|imb[128]) pattern=b lcontext= rcontext=
>> subst=(b|v[1024])
>> no rules match for phonetic item 1 at position 2:
>> (ob[2012]|ov[1024]|omb[128]|imb[128])r
>> after applying final rule #93 to phonetic item #1 at position 3:
>>
>> (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
>> ]|imbro[128]) pattern=A lcontext= rcontext=[fklmnprst]$ subst=(a|o)
>> no rules match for phonetic item 1 at position 4:
>>
>> (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
>> ]|imbro[128])m
>>
>>
>>
>>
>>
>>
>>
>> resulting tokens:
>>
>>
>>
>> (abram|abrom|avram|avrom|obram|obrom|ovram|ovrom|Ybram|Ybrom|ombram|ombrom|i
>> mbram|imbrom)
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [CODEC] Beider Morse Phonetic Matching Bug and questions

Posted by Thomas Neidhart <th...@gmail.com>.

Hi,

as already commented on https://issues.apache.org/jira/browse/CODEC-187 the
problem is related to some wrongly ported rule files from the original
source.

This otoh, creates a serious problem for us, as it looks like that the
Beider-Morse phonetic matching encoder in commons-codec is derived work
from a php codebase released under the GPLv3 licence.
The original codebase is available at http://stevemorse.org/phoneticinfo.htm.
While investigating the bug and comparing our rule file with the ones from
the origina codebase it is quite clear that at least these are identical.

The author of the patch (see https://issues.apache.org/jira/browse/CODEC-125)
ported the code and applied the Apache license, but the license of the
original codebase was never considered or discussed.

This is quite serious I guess, as we have already released the code. We can
ask the original authors to re-license their code to the Apache Software
Foundation under a compatible license, but I wonder if they are willing to
do so.
This encoder is also used a lot in lucene/solr so it might have even larger
implications.

Any ideas how to proceed or if a re-licensing would be sufficient in this
case?

Thomas


On Wed, Jun 11, 2014 at 9:08 AM, Michael Tobias <mi...@tobias.org.uk>
wrote:

> Does anybody have a working knowledge of the coding of the Beider Morse
> Phonetic Matching in the Apache Commons Codec?
>
>
>
> My recent tests using Solr suggest there is a discrepancy between Steve
> Morse and Alexander Beider's algorithm and the algorithm currently live in
> Solr (and hence the Commons Codec).
>
>
>
> I know that the source code for BMPM issued by Steve has changed several
> times over the years, and I thought at first it might be that the version
> used in the Commons Codec is an old version that has subsequently been
> overtaken.  Should the version of the BMPM algorithm not be listed in the
> Commons Codec documentation? How should version changes to the algorithm be
> implemented? The algorithm is quite static now so this is probably not so
> important now but surely it should be DOCUMENTED???
>
>
>
> My tests now indicate that the discrepancies are NOT a version problem as
> testing against a very old version 2.00 of the BMPM source code issued on
> 18
> June 2009 still exhibits the same problem.
>
>
>
> Using just a single test term the results are not good. The only saving
> grace is that the most widely used version is
>
>
>
> nameType="GENERIC" ruleType="APPROX"
>
>
>
> and that is a close (but not perfect) match at least for this ONE test
> word.
>
>
>
> For the name Abram, all with languageSet="auto"
>
>
>
> GENERIC APPROX - fails - misses a few tokens
>
> Should create tokens: abram abrom avram avrom obram obrom ovram ovrom abran
> abron obran obron Ybram Ybrom
>
> Solr creates: abram abrom avram avrom obram obrom ovram ovrom abran abron
> obran obron
>
>
>
> GENERIC EXACT - good!
>
> Should create tokens: abram abran
>
> Solr creates: abram abran
>
>
>
> ASHKENAZI APPROX: - fails dreadfully!
>
> Should create tokens: abram abrom avram avrom obram obrom ovram ovrom Ybram
> Ybrom ombram ombrom imbram imbrom
>
> Solr creates: abrAm AvrAm BbrAm
>
>
>
> ASHKENAZI EXACT: - good!
>
> Should create tokens: abram
>
> Solr creates: abram
>
>
>
> SEPHARDIC APPROX: - good!
>
> Should create tokens: abram bram abran bran avram vram
>
> Solr creates: abram bram abran bran avram vram
>
>
>
> SEPHARDIC EXACT: - good!
>
> Should create tokens: abram abran avram
>
> Solr creates: abram abran avram
>
>
>
> I would appreciate it if somebody with knowledge of the programming of this
> functionality could investigate.
>
>
>
> For the worst case I attach here a debug trace of the calculation of the
> Ashkenazi Approx tokens straight from Steve Morse' implementation. It looks
> like some of the final rules are not being implemented properly, or at all.
> The language codes in parenthesis vary from BMPM version to version but the
> resulting tokens have not changed from version 2.00 up to the current 3.02
>
>
>
> Thanks
>
>
>
> Michael
>
>
>
>
>
>
>
> applying language rules from (rulesany) to abram using languages 2012
>
> char codes = [#61]a [#62]b [#72]r [#61]a [#6d]m
>
> applying rule #225
>    pattern=a
>    lcontext=
>    rcontext=[bcdgkpstwzż]
>    subst=(A|B[128])
>    result=(A[2012]|B[128])
>
> applying rule #229
>    pattern=b
>    lcontext=
>    rcontext=
>    subst=b
>    result=(Ab[2012]|Bb[128])
>
> applying rule #245
>    pattern=r
>    lcontext=
>    rcontext=
>    subst=r
>    result=(Abr[2012]|Bbr[128])
>
> applying rule #228
>    pattern=a
>    lcontext=
>    rcontext=
>    subst=A
>    result=(AbrA[2012]|BbrA[128])
>
> applying rule #240
>    pattern=m
>    lcontext=
>    rcontext=
>    subst=m
>    result=(AbrAm[2012]|BbrAm[128])
>
> after language rules: (AbrAm[2012]|BbrAm[128])
>
>
> applying final rules from (exactapproxcommon plus approxcommon) to
> AbrAm[2012]
> no rules match for phonetic item 0 at position 0: A
> no rules match for phonetic item 0 at position 1: Ab
> no rules match for phonetic item 0 at position 2: Abr
> no rules match for phonetic item 0 at position 3: AbrA
> no rules match for phonetic item 0 at position 4: AbrAm
>
> applying final rules from (exactapproxcommon plus approxcommon) to
> BbrAm[128]
> no rules match for phonetic item 1 at position 0: B
> no rules match for phonetic item 1 at position 1: Bb
> no rules match for phonetic item 1 at position 2: Bbr
> no rules match for phonetic item 1 at position 3: BbrA
> no rules match for phonetic item 1 at position 4: BbrAm
>
> applying final rules from (approxany) to AbrAm[2012]
> after applying final rule #97 to phonetic item #0 at position 0:
> (a[2012]|o[2012]|Y[16]) pattern=A lcontext= rcontext= subst=(a|o|Y[16])
> after applying final rule #0 to phonetic item #0 at position 1:
> (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16]) pattern=b lcontext= rcontext=
> subst=(b|v[1024])
> no rules match for phonetic item 0 at position 2:
> (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16])r
> after applying final rule #93 to phonetic item #0 at position 3:
>
> (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
> ]|ovro[1024]|Ybra[16]|Ybro[16]) pattern=A lcontext= rcontext=[fklmnprst]$
> subst=(a|o)
> no rules match for phonetic item 0 at position 4:
>
> (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
> ]|ovro[1024]|Ybra[16]|Ybro[16])m
>
> applying final rules from (approxany) to BbrAm[128]
> after applying final rule #22 to phonetic item #1 at position 0:
> (o[2012]|om[128]|im[128]) pattern=B lcontext= rcontext=[bp]
> subst=(o|om[128]|im[128])
> after applying final rule #0 to phonetic item #1 at position 1:
> (ob[2012]|ov[1024]|omb[128]|imb[128]) pattern=b lcontext= rcontext=
> subst=(b|v[1024])
> no rules match for phonetic item 1 at position 2:
> (ob[2012]|ov[1024]|omb[128]|imb[128])r
> after applying final rule #93 to phonetic item #1 at position 3:
>
> (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
> ]|imbro[128]) pattern=A lcontext= rcontext=[fklmnprst]$ subst=(a|o)
> no rules match for phonetic item 1 at position 4:
>
> (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
> ]|imbro[128])m
>
>
>
>
>
>
>
> resulting tokens:
>
>
>
> (abram|abrom|avram|avrom|obram|obrom|ovram|ovrom|Ybram|Ybrom|ombram|ombrom|i
> mbram|imbrom)
>
>