You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Bruno P. Kinoshita (JIRA)" <ji...@apache.org> on 2019/03/10 04:39:00 UTC

[jira] [Updated] (TEXT-155) Add a generic OverlapSimilarity measure

     [ https://issues.apache.org/jira/browse/TEXT-155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bruno P. Kinoshita updated TEXT-155:
------------------------------------
    Summary: Add a generic OverlapSimilarity measure  (was: Add a generic SetSimilarity measure)

> Add a generic OverlapSimilarity measure
> ---------------------------------------
>
>                 Key: TEXT-155
>                 URL: https://issues.apache.org/jira/browse/TEXT-155
>             Project: Commons Text
>          Issue Type: New Feature
>    Affects Versions: 1.6
>            Reporter: Alex D Herbert
>            Priority: Minor
>          Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> The {{SimilarityScore<T>}} interface can be used to compute a generic result. I propose to add a class that can compute the intersection between two sets formed from the characters. The sets must be formed from the {{CharSequence}} input to the {{apply}} method using a {{Function<CharSequence, Set<T>>}} to convert the {{CharSequence}}. This function can be passed to the {{SimilarityScore<T>}} during construction.
> The result can then be computed to have the size of each set and the intersection.
> I have created an implementation that can compute the equivalent of the {{JaccardSimilary}} class by creating {{Set<Character>}} and also the F1-score using bigrams (pairs of characters) by creating {{Set<String>}}. This relates to [Text-126|https://issues.apache.org/jira/projects/TEXT/issues/TEXT-126] which suggested an algorithm for the Sorensen-Dice similarity, also known as the F1-score.
> Here is an example:
> {code:java}
> // Match the functionality of the JaccardSimilarity class
> Function<CharSequence, Set<Character>> converter = (cs) -> {
>     final Set<Character> set = new HashSet<>();
>     for (int i = 0; i < cs.length(); i++) {
>         set.add(cs.charAt(i));
>     }
>     return set;
> };
> IntersectionSimilarity<Character> similarity = new IntersectionSimilarity<>(converter);
> IntersectionResult result = similarity.apply("something", "something else");
> {code}
> The result has the size of set A, set B and the intersection between them.
> This class was inspired by my look through the various similarity implementations. All of them except the {{CosineSimilarity}} perform single character matching between the input {{CharSequence}}s. The {{CosineSimilarity}} tokenises using whitespace to create words.
> This more generic type of implementation will allow a user to determine how to divide the {{CharSequence}} but to create the sets that are compared, e.g. single characters, words, bigrams, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)