You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Bruno P. Kinoshita (JIRA)" <ji...@apache.org> on 2019/03/09 10:14:00 UTC

[jira] [Created] (TEXT-158) Incorrect values for Jaccard similarity with empty strings

Bruno P. Kinoshita created TEXT-158:
---------------------------------------

             Summary: Incorrect values for Jaccard similarity with empty strings
                 Key: TEXT-158
                 URL: https://issues.apache.org/jira/browse/TEXT-158
             Project: Commons Text
          Issue Type: Bug
    Affects Versions: 1.6
            Reporter: Bruno P. Kinoshita
             Fix For: 1.7


In a discussion part of TEXT-126, it was [pointed|https://github.com/apache/commons-text/pull/103#discussion_r263988298] that the Jaccard similarity returns 0.0, and the distance 1.0. While in other libraries it returns the opposite for each.
{code:java}
package br.eti.kinoshita.tests.text;

import java.util.Collections;

public class EditDistances {

    public static void main(String[] args) {
        System.out.println("Testing jaccard sim/dis with empty strings");
        System.out.println("---");
        org.simmetrics.metrics.Jaccard<String> j1 = new org.simmetrics.metrics.Jaccard<>();
        float s1 = j1.compare(Collections.emptySet(), Collections.emptySet());
        System.out.println("Simmetrics Jaccard similarity: " + s1);
        float d1 = j1.distance(Collections.emptySet(), Collections.emptySet());
        System.out.println("Simmetrics Jaccard distance: " + d1);
        
        System.out.println("---");
        
        info.debatty.java.stringsimilarity.Jaccard j2 = new info.debatty.java.stringsimilarity.Jaccard();
        double s2 = j2.similarity("", "");
        System.out.println("javastringsimilarity Jaccard similarity: " + s2);
        double d2 = j2.distance("", "");
        System.out.println("javastringsimilarity Jaccard distance: " + d2);
        
        System.out.println("---");
        
        org.apache.commons.text.similarity.JaccardSimilarity j3_1 = new org.apache.commons.text.similarity.JaccardSimilarity();
        double s3 = j3_1.apply("", "");
        System.out.println("commons-text Jaccard similarity: " + s3);
        org.apache.commons.text.similarity.JaccardDistance j3_2 = new org.apache.commons.text.similarity.JaccardDistance();
        double d3 = j3_2.apply("", "");
        System.out.println("commons-text Jaccard distance: " + d3);
    }
}{code}
Produces:
{noformat}
Testing jaccard sim/dis with empty strings
---
Simmetrics Jaccard similarity: 1.0
Simmetrics Jaccard distance: 0.0
---
javastringsimilarity Jaccard similarity: 1.0
javastringsimilarity Jaccard distance: 0.0
---
commons-text Jaccard similarity: 0.0
commons-text Jaccard distance: 1.0{noformat}
We need to confirm what's the correct output for similarity and distance with empty strings. And either document why we are returning what we are returning, or fix it as a bug for the next release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)