You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Alex D Herbert (JIRA)" <ji...@apache.org> on 2019/03/08 11:37:00 UTC
[jira] [Updated] (TEXT-157) Remove rounding from JaccardSimilarity
and Distance
[ https://issues.apache.org/jira/browse/TEXT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alex D Herbert updated TEXT-157:
--------------------------------
Description:
The {{JaccardSimilarity}} uses rounding to 2 decimal places. This prevents ranking of dissimilar sequences of even moderately short length.
Using sequences with 1 or 2 characters in common and the remaining characters are different:
{noformat}
2 0.500000 1.000000 : aa vs (ab or aa)
3 0.250000 0.330000 : aaD vs (abd or aaÀ)
4 0.170000 0.200000 : aaDE vs (abde or aaÀÁ)
5 0.130000 0.140000 : aaDEF vs (abdef or aaÀÁÂ)
6 0.100000 0.110000 : aaDEFG vs (abdefg or aaÀÁÂÃ)
7 0.080000 0.090000 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
8 0.070000 0.080000 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
9 0.060000 0.070000 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
10 0.060000 0.060000 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
{noformat}
Without rounding the scores are different where previously rounding had produced the same score. This will improve ranking:
{noformat}
2 0.500000 1.000000 : aa vs (ab or aa)
3 0.250000 0.333333 : aaD vs (abd or aaÀ)
4 0.166667 0.200000 : aaDE vs (abde or aaÀÁ)
5 0.125000 0.142857 : aaDEF vs (abdef or aaÀÁÂ)
6 0.100000 0.111111 : aaDEFG vs (abdefg or aaÀÁÂÃ)
7 0.083333 0.090909 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
8 0.071429 0.076923 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
9 0.062500 0.066667 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
10 0.055556 0.058824 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
{noformat}
Generated using:
{code:java}
@Test
public void roundingDemo() {
// First character of each dissimilar sequence.
// Chosen for a nice output where we already know the loop
// will exit before sequence overlap.
char ch1 = 'D';
char ch2 = 'd';
char ch3 = 0x00c0;
// 1 or 2 characters in common
StringBuilder sb1 = new StringBuilder("aa");
StringBuilder sb2 = new StringBuilder("ab"); // 1 in common
StringBuilder sb3 = new StringBuilder("aa"); // 2 in common
JaccardSimilarity similarity = new JaccardSimilarity();
// Extend the sequences until a single/double character
// similarity cannot be detected
double j1, j2;
do {
j1 = similarity.apply(sb1, sb2);
j2 = similarity.apply(sb1, sb3);
System.out.printf("%2d %f %f : %s vs (%s or %s)%n",
sb1.length(), j1, j2, sb1, sb2, sb3);
// Extend the sequence using unique characters for each
sb1.append(ch1++);
sb2.append(ch2++);
sb3.append(ch3++);
// Note: Check length since the sequences will overlap
// in case the rounding is not present
} while (j1 != j2 && sb1.length() < 26);
}
{code}
was:
The {{JaccardSimilarity}} uses rounding to 2 decimal places. This prevents detection of dissimilar sequences of even moderately short length.
Using sequences with 1 or 2 characters in common and the remaining characters are different:
{noformat}
2 0.500000 1.000000 : aa vs (ab or aa)
3 0.250000 0.330000 : aaD vs (abd or aaÀ)
4 0.170000 0.200000 : aaDE vs (abde or aaÀÁ)
5 0.130000 0.140000 : aaDEF vs (abdef or aaÀÁÂ)
6 0.100000 0.110000 : aaDEFG vs (abdefg or aaÀÁÂÃ)
7 0.080000 0.090000 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
8 0.070000 0.080000 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
9 0.060000 0.070000 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
10 0.060000 0.060000 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
{noformat}
Without rounding the scores are different where previously rounding had produced the same score:
{noformat}
2 0.500000 1.000000 : aa vs (ab or aa)
3 0.250000 0.333333 : aaD vs (abd or aaÀ)
4 0.166667 0.200000 : aaDE vs (abde or aaÀÁ)
5 0.125000 0.142857 : aaDEF vs (abdef or aaÀÁÂ)
6 0.100000 0.111111 : aaDEFG vs (abdefg or aaÀÁÂÃ)
7 0.083333 0.090909 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
8 0.071429 0.076923 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
9 0.062500 0.066667 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
10 0.055556 0.058824 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
{noformat}
Generated using:
{code:java}
@Test
public void roundingDemo() {
// First character of each dissimilar sequence.
// Chosen for a nice output where we already know the loop
// will exit before sequence overlap.
char ch1 = 'D';
char ch2 = 'd';
char ch3 = 0x00c0;
// 1 or 2 characters in common
StringBuilder sb1 = new StringBuilder("aa");
StringBuilder sb2 = new StringBuilder("ab"); // 1 in common
StringBuilder sb3 = new StringBuilder("aa"); // 2 in common
JaccardSimilarity similarity = new JaccardSimilarity();
// Extend the sequences until a single/double character
// similarity cannot be detected
double j1, j2;
do {
j1 = similarity.apply(sb1, sb2);
j2 = similarity.apply(sb1, sb3);
System.out.printf("%2d %f %f : %s vs (%s or %s)%n",
sb1.length(), j1, j2, sb1, sb2, sb3);
// Extend the sequence using unique characters for each
sb1.append(ch1++);
sb2.append(ch2++);
sb3.append(ch3++);
// Note: Check length since the sequences will overlap
// in case the rounding is not present
} while (j1 != j2 && sb1.length() < 26);
}
{code}
> Remove rounding from JaccardSimilarity and Distance
> ---------------------------------------------------
>
> Key: TEXT-157
> URL: https://issues.apache.org/jira/browse/TEXT-157
> Project: Commons Text
> Issue Type: Improvement
> Affects Versions: 1.6
> Reporter: Alex D Herbert
> Assignee: Alex D Herbert
> Priority: Trivial
>
> The {{JaccardSimilarity}} uses rounding to 2 decimal places. This prevents ranking of dissimilar sequences of even moderately short length.
> Using sequences with 1 or 2 characters in common and the remaining characters are different:
> {noformat}
> 2 0.500000 1.000000 : aa vs (ab or aa)
> 3 0.250000 0.330000 : aaD vs (abd or aaÀ)
> 4 0.170000 0.200000 : aaDE vs (abde or aaÀÁ)
> 5 0.130000 0.140000 : aaDEF vs (abdef or aaÀÁÂ)
> 6 0.100000 0.110000 : aaDEFG vs (abdefg or aaÀÁÂÃ)
> 7 0.080000 0.090000 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
> 8 0.070000 0.080000 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
> 9 0.060000 0.070000 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
> 10 0.060000 0.060000 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
> {noformat}
> Without rounding the scores are different where previously rounding had produced the same score. This will improve ranking:
> {noformat}
> 2 0.500000 1.000000 : aa vs (ab or aa)
> 3 0.250000 0.333333 : aaD vs (abd or aaÀ)
> 4 0.166667 0.200000 : aaDE vs (abde or aaÀÁ)
> 5 0.125000 0.142857 : aaDEF vs (abdef or aaÀÁÂ)
> 6 0.100000 0.111111 : aaDEFG vs (abdefg or aaÀÁÂÃ)
> 7 0.083333 0.090909 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
> 8 0.071429 0.076923 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
> 9 0.062500 0.066667 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
> 10 0.055556 0.058824 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
> {noformat}
> Generated using:
> {code:java}
> @Test
> public void roundingDemo() {
> // First character of each dissimilar sequence.
> // Chosen for a nice output where we already know the loop
> // will exit before sequence overlap.
> char ch1 = 'D';
> char ch2 = 'd';
> char ch3 = 0x00c0;
> // 1 or 2 characters in common
> StringBuilder sb1 = new StringBuilder("aa");
> StringBuilder sb2 = new StringBuilder("ab"); // 1 in common
> StringBuilder sb3 = new StringBuilder("aa"); // 2 in common
> JaccardSimilarity similarity = new JaccardSimilarity();
> // Extend the sequences until a single/double character
> // similarity cannot be detected
> double j1, j2;
> do {
> j1 = similarity.apply(sb1, sb2);
> j2 = similarity.apply(sb1, sb3);
> System.out.printf("%2d %f %f : %s vs (%s or %s)%n",
> sb1.length(), j1, j2, sb1, sb2, sb3);
> // Extend the sequence using unique characters for each
> sb1.append(ch1++);
> sb2.append(ch2++);
> sb3.append(ch3++);
> // Note: Check length since the sequences will overlap
> // in case the rounding is not present
> } while (j1 != j2 && sb1.length() < 26);
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)