You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Ben-Waters (via GitHub)" <gi...@apache.org> on 2023/06/26 22:06:47 UTC

[GitHub] [commons-codec] Ben-Waters opened a new pull request, #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Ben-Waters opened a new pull request, #189:
URL: https://github.com/apache/commons-codec/pull/189

   With the current implementation of NYSIIS, it is possible to incorrectly remove the first character from the encoding. 
   
   According to the algorithm the first character of the string should be the first character of the encoding, then based on a bunch of other rules are applied to the string characters are removed. The implementation in commons-codec passes the entire string into the transcodeRemaining method which works for the most part and then afterwards, checks that there is at least 1 character before removing the final 'A' or 'S'. 
   
   The problem is, if you have a word like "ASH" you will end up with a single final character of "A". Similarly with "SSH" you would have "S" and the logic will currently remove it and return a blank string when it should still return at least the first letter of the original string.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] kinow commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "kinow (via GitHub)" <gi...@apache.org>.
kinow commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1244213822


##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                 new String[] { "JILES", "JAL" },
                 // violates 6: if the last two characters are AY, remove A
                 new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});

Review Comment:
   I couldn't find a C++ implementation nor a good source from Wikipedia. So I checked another implementation, `phonics` in R. It has a nysiis algorithm implementation that allows for "modified" key. For "A" it gives "" too, but the modified version gives "AS".
   
   ```R
   >install.packages('phonics')
   >library('phonics')
   >nysiis(c("ASH"))
   [1] ""
   > nysiis(c("ASH"), modified=TRUE)
   [1] "AS"
   > nysiis(c("ASHBURTON"))
   [1] "ASBART"
   ```
   
   Their [vignette](https://cloud.r-project.org/web/packages/phonics/phonics.pdf) (PDF) (a vignette is like a javadocs for an R package) documents the basic algorithm, but links to this PDF that explains the whole package a lot better: [James P. Howard, II, Phonetic Spelling Algorithm Implementations for R](https://www.jstatsoft.org/article/view/v095i08)
   
   ![image](https://github.com/apache/commons-codec/assets/304786/44bf589f-4e98-4cf2-8c24-fc099c665933)
   
   Searching more about those papers after the text above, I found this issue that describes the same thing I just said :grimacing: : [CODEC-235](https://issues.apache.org/jira/browse/CODEC-235)
   
   So my think we should document that the current version in Commons Codec is the one from the first paper, and then maybe add the other implementation as a separate class/method and let users to pick which one they want to use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] garydgregory commented on pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "garydgregory (via GitHub)" <gi...@apache.org>.
garydgregory commented on PR #189:
URL: https://github.com/apache/commons-codec/pull/189#issuecomment-1675511771

   I just re-read the comments it seems like:
   - We are not sure if the current code implements the "plain" or original algorithm.
   - We don't have access to, or cannot find, the paper for the original algorithm.
   - If we implement the plain original algorithm, then we can add a new class for the newer "modifed" algorithm.
   - If we do not implement the plain original algorithm, then we need to talk about that.
   
   Help needed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] garydgregory commented on pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "garydgregory (via GitHub)" <gi...@apache.org>.
garydgregory commented on PR #189:
URL: https://github.com/apache/commons-codec/pull/189#issuecomment-1609398223

   @Ben-Waters Not directly related, but do you have any thoughts on #281?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] Ben-Waters commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "Ben-Waters (via GitHub)" <gi...@apache.org>.
Ben-Waters commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1244635165


##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                 new String[] { "JILES", "JAL" },
                 // violates 6: if the last two characters are AY, remove A
                 new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});

Review Comment:
   Based on the [steps from wikipedia](https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System#cite_ref-taft_2-0), the core of the issue is that in step 4, we are setting the pointer to the first character instead of the 2nd character. The 9th step on wikipedia is a bit ambiguous though as to whether that step includes the first character or not. If it does, then you would get a blank string like commons-codec currently gives. If not, then you would get 'A'.
   
   Unfortunately I don't have access to the original source book for the algorithm to try to clarify. It looks like [berkeley law](https://lawcat.berkeley.edu/record/38122) has it but I don't have an account for that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] garydgregory commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "garydgregory (via GitHub)" <gi...@apache.org>.
garydgregory commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1242890706


##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                 new String[] { "JILES", "JAL" },
                 // violates 6: if the last two characters are AY, remove A
                 new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});

Review Comment:
   Hi @Ben-Waters 
   Thank you for your PR.
   Who is right? This or Fuzzy?
   ```
   py
   Python 3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)] on win32
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import fuzzy
   >>> fuzzy.nysiis('ASH')
   'AS'
   >>>
   ```
   Thoughts?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] Ben-Waters commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "Ben-Waters (via GitHub)" <gi...@apache.org>.
Ben-Waters commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1242928389


##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                 new String[] { "JILES", "JAL" },
                 // violates 6: if the last two characters are AY, remove A
                 new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});

Review Comment:
   @garydgregory 
   Based on this [other implementation](http://www.dropby.com/indexLF.html?content=/NYSIIS.html) it would be just A.
   
   According to the [algorithm](https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System), the final character should be removed if it is an 'S' so I would think it should be removed.
   
   The current commons-codec implementation is removing it as well as the final 'A' but is ignoring the part about "The first character of the NYSIIS code is the first character of the name."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] Ben-Waters commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "Ben-Waters (via GitHub)" <gi...@apache.org>.
Ben-Waters commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1243061478


##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                 new String[] { "JILES", "JAL" },
                 // violates 6: if the last two characters are AY, remove A
                 new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});

Review Comment:
   It's pretty frustrating but it seems like the results are inconsistent depending on where you go.
   
   Looking at [this link](https://rosettacode.org/wiki/NYSIIS) the Java version returns AS but the python returns A. For my use-case, either one of those would be fine but it seems hard to get a consistent answer. Commons-codec is currently returning nothing though and I haven't seen another do that yet.
   
   I don't know of a good official implementation to run it against.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] garydgregory commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "garydgregory (via GitHub)" <gi...@apache.org>.
garydgregory commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1242990559


##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                 new String[] { "JILES", "JAL" },
                 // violates 6: if the last two characters are AY, remove A
                 new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});

Review Comment:
   I had not used it before today either. I was looking for _something_ easy to compare this PR's behavior. 
   
   Let's see if anybody else has feedback. 
   
   Do you happen to know of anything else that is easily testable for this use-case? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] garydgregory commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "garydgregory (via GitHub)" <gi...@apache.org>.
garydgregory commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1243066100


##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                 new String[] { "JILES", "JAL" },
                 // violates 6: if the last two characters are AY, remove A
                 new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});

Review Comment:
   I would think that returning something like "A" is better than "". Let's see if anyone else suggests anything. @kinow ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] garydgregory commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "garydgregory (via GitHub)" <gi...@apache.org>.
garydgregory commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1244281309


##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                 new String[] { "JILES", "JAL" },
                 // violates 6: if the last two characters are AY, remove A
                 new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});

Review Comment:
   The question that comes up for me is: Is our current implementation "plain" NYSIIS, or, is any non-standard or any behavior from the "modified" L&A 1977 algorithm present in our current implementation? If our current code is "plain", then I could see us creating a `ModifiedNysiis` class, if not, then we are in a bit of a pickle.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] garydgregory commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "garydgregory (via GitHub)" <gi...@apache.org>.
garydgregory commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1242990559


##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                 new String[] { "JILES", "JAL" },
                 // violates 6: if the last two characters are AY, remove A
                 new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});

Review Comment:
   I had not used it before today either. I was looking for _something_ easy to compare this PR's behavior. 
   
   Let's see if anybody else has feedback. 
   
   Do you know of any other software that is easily testable for this use-case? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] Ben-Waters commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "Ben-Waters (via GitHub)" <gi...@apache.org>.
Ben-Waters commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1242928389


##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                 new String[] { "JILES", "JAL" },
                 // violates 6: if the last two characters are AY, remove A
                 new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});

Review Comment:
   Based on this [other implementation](http://www.dropby.com/indexLF.html?content=/NYSIIS.html) it would be just A.
   
   According to the [algorithm](https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System), the final character should be removed if it is an 'S' so I would think it should be removed.
   
   The current commons-codec implementation is removing it as well as the final 'A' but is ignoring the part about "The first character of the NYSIIS code is the first character of the name."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] kinow commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "kinow (via GitHub)" <gi...@apache.org>.
kinow commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1244213822


##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                 new String[] { "JILES", "JAL" },
                 // violates 6: if the last two characters are AY, remove A
                 new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});

Review Comment:
   I couldn't find a C++ implementation nor a good source from Wikipedia. So I checked another implementation, `phonics` in R. It has a nysiis algorithm implementation that allows for "modified" key. For "A" it gives "" too, but the modified version gives "AS".
   
   ```R
   >install.packages('phonics')
   >library('phonics')
   >nysiis(c("ASH"))
   [1] ""
   > nysiis(c("ASH"), modified=TRUE)
   [1] "AS"
   > nysiis(c("ASHBURTON"))
   [1] "ASBART"
   ```
   
   Their [docs](https://cloud.r-project.org/web/packages/phonics/phonics.pdf) (PDF) documents the basic algorithm, but links to this PDF that explains the whole package a lot better: [James P. Howard, II, Phonetic Spelling Algorithm Implementations for R](https://www.jstatsoft.org/article/view/v095i08)
   
   ![image](https://github.com/apache/commons-codec/assets/304786/44bf589f-4e98-4cf2-8c24-fc099c665933)
   
   Searching more about those papers after the text above, I found this issue that describes the same thing I just said :grimacing: : [CODEC-235](https://issues.apache.org/jira/browse/CODEC-235)
   
   So my think we should document that the current version in Commons Codec is the one from the first paper, and then maybe add the other implementation as a separate class/method and let users to pick which one they want to use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] Ben-Waters commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "Ben-Waters (via GitHub)" <gi...@apache.org>.
Ben-Waters commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1242934167


##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                 new String[] { "JILES", "JAL" },
                 // violates 6: if the last two characters are AY, remove A
                 new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});

Review Comment:
   I haven't used fuzzy before but looking at [their implementation](https://github.com/yougov/fuzzy/blob/master/src/fuzzy.pyx#L81) I don't see where they remove the trailing S or A.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] kinow commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "kinow (via GitHub)" <gi...@apache.org>.
kinow commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1244288609


##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                 new String[] { "JILES", "JAL" },
                 // violates 6: if the last two characters are AY, remove A
                 new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});

Review Comment:
   Although the results seem to indicate we implement the plain NYSIIS, someone has indeed to verify it reading our code and the algorithm. A good point, +1!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-codec] Ben-Waters commented on pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Posted by "Ben-Waters (via GitHub)" <gi...@apache.org>.
Ben-Waters commented on PR #189:
URL: https://github.com/apache/commons-codec/pull/189#issuecomment-1610607556

   > @Ben-Waters Not directly related, but do you have any thoughts on #36?
   
   
   
   > @Ben-Waters Not directly related, but do you have any thoughts on #36?
   
   Hmmm I'm no expert on this since it isn't NYSIIS but some other algorithm but I can take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org