You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Alex D Herbert (JIRA)" <ji...@apache.org> on 2019/03/09 13:29:00 UTC

[jira] [Commented] (TEXT-158) Incorrect values for Jaccard similarity with empty strings

    [ https://issues.apache.org/jira/browse/TEXT-158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16788676#comment-16788676 ] 

Alex D Herbert commented on TEXT-158:
-------------------------------------

There are arguments for each. It amounts to what is correct here:
{noformat}
intersect / union = Jaccard

0 / 0             = n / n     = 1
0 / 0             = n / 0     = NaN
0 / 0             = 0 / n     = 0
{noformat}
Java would return {{NaN}}.

However an empty string has a different meaning from {{null}}. Comparing {{null}} for similarity is invalid and should either be {{NaN}}, or as is currently done in the library throw an {{IllegalArgumentException}}.

Comparing an empty string to something is valid so should return something sensible so the user does not come unstuck.

I would vote for an empty string to be perfectly similar to another empty string.

This is inline with other similarity scores in the library, e.g. {{JaroWinklerSimilarity}}.

So this is a bug.

 

> Incorrect values for Jaccard similarity with empty strings
> ----------------------------------------------------------
>
>                 Key: TEXT-158
>                 URL: https://issues.apache.org/jira/browse/TEXT-158
>             Project: Commons Text
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Bruno P. Kinoshita
>            Priority: Minor
>             Fix For: 1.7
>
>
> In a discussion part of TEXT-126, it was [pointed|https://github.com/apache/commons-text/pull/103#discussion_r263988298] that the Jaccard similarity returns 0.0, and the distance 1.0. While in other libraries it returns the opposite for each.
> {code:java}
> package br.eti.kinoshita.tests.text;
> import java.util.Collections;
> public class EditDistances {
>     public static void main(String[] args) {
>         System.out.println("Testing jaccard sim/dis with empty strings");
>         System.out.println("---");
>         org.simmetrics.metrics.Jaccard<String> j1 = new org.simmetrics.metrics.Jaccard<>();
>         float s1 = j1.compare(Collections.emptySet(), Collections.emptySet());
>         System.out.println("Simmetrics Jaccard similarity: " + s1);
>         float d1 = j1.distance(Collections.emptySet(), Collections.emptySet());
>         System.out.println("Simmetrics Jaccard distance: " + d1);
>         
>         System.out.println("---");
>         
>         info.debatty.java.stringsimilarity.Jaccard j2 = new info.debatty.java.stringsimilarity.Jaccard();
>         double s2 = j2.similarity("", "");
>         System.out.println("javastringsimilarity Jaccard similarity: " + s2);
>         double d2 = j2.distance("", "");
>         System.out.println("javastringsimilarity Jaccard distance: " + d2);
>         
>         System.out.println("---");
>         
>         org.apache.commons.text.similarity.JaccardSimilarity j3_1 = new org.apache.commons.text.similarity.JaccardSimilarity();
>         double s3 = j3_1.apply("", "");
>         System.out.println("commons-text Jaccard similarity: " + s3);
>         org.apache.commons.text.similarity.JaccardDistance j3_2 = new org.apache.commons.text.similarity.JaccardDistance();
>         double d3 = j3_2.apply("", "");
>         System.out.println("commons-text Jaccard distance: " + d3);
>     }
> }{code}
> Produces:
> {noformat}
> Testing jaccard sim/dis with empty strings
> ---
> Simmetrics Jaccard similarity: 1.0
> Simmetrics Jaccard distance: 0.0
> ---
> javastringsimilarity Jaccard similarity: 1.0
> javastringsimilarity Jaccard distance: 0.0
> ---
> commons-text Jaccard similarity: 0.0
> commons-text Jaccard distance: 1.0{noformat}
> We need to confirm what's the correct output for similarity and distance with empty strings. And either document why we are returning what we are returning, or fix it as a bug for the next release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)