You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Thomas Neidhart (JIRA)" <ji...@apache.org> on 2014/07/04 15:51:34 UTC

[jira] [Updated] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens

     [ https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Neidhart updated CODEC-187:
----------------------------------

    Attachment: CODEC_187_sync_with_v3.3.diff

The latest attached patch (sync_with_v3.3) applies the following changes:

 * complete sync with rules from original Beider/Morse v3.3
 * use different language detection for each name type (ash, gen, sep) previously only one generic rule set was used, but the original code distinguishes between the name types
 * fix a bug when applying the final rules: if two phonemes with the same text but different language sets where encountered, only the first one was stored as the comparator did not take the language into account

The results are identical for the previously reported failures.

One exception is the case when explicitly selecting hebrew as language, then the results are different, but I do not understand the original code in this regard. We may need to contact the authors about this issue.

> Beider Morse Phonetic Matching producing incorrect tokens
> ---------------------------------------------------------
>
>                 Key: CODEC-187
>                 URL: https://issues.apache.org/jira/browse/CODEC-187
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.9
>            Reporter: michael tobias
>            Priority: Minor
>             Fix For: 1.10
>
>         Attachments: CODEC-187.patch, CODEC-187_ashkenazi_approx_any.patch, CODEC-187_ashkenazi_approx_any_v2.patch, CODEC_187_sync_with_v3.3.diff
>
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 though it had been static since version 3.01 dated 19 Dec 2011 (it was first available as opensource as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was implemented so I am not sure if the problem with the algorithm as coded in the Codec is simply an old version or whether there are more basic problems with the implementation.
> How do I determine the version of the algorithm that was implemented in the Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate and working as expected?



--
This message was sent by Atlassian JIRA
(v6.2#6252)