You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2019/08/13 18:38:00 UTC

[jira] [Commented] (IMPALA-8752) Add Jaro-winkler edit distance and similarity built-in function

    [ https://issues.apache.org/jira/browse/IMPALA-8752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906504#comment-16906504 ] 

ASF subversion and git services commented on IMPALA-8752:
---------------------------------------------------------

Commit 8db7f27ddde226f3efd3bddcc00665d0d9b99ef0 in impala's branch refs/heads/master from luksan47
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8db7f27 ]

IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function

The added functions return the Jaro/Jaro-Winkler similarity/distance
of two strings. The algorithm calcuates the Jaro-Similarity of the
strings, then adds more weight to the result if there are
common prefixes. (Jaro-Winkler)
For more detail, see:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Extended the algorithm with another optional parameter: boost threshold
The prefix weight will only be applied if the Jaro-similarity
exceeds the given threshold. By default, its value is 0.7.

The new built-in functions are:
 * jaro_distance, jaro_dst
 * jaro_similarity, jaro_sim
 * jaro_winkler_distance, jw_dst
 * jaro_winkler_similarity, jw_sim

Testing:
 * Added unit tests to expr-test.cc
 * Manual testing over 1400 word pairs from
   http://marvin.cs.uidaho.edu/misspell.html
   Results match Apache commons

Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Reviewed-on: http://gerrit.cloudera.org:8080/13870
Reviewed-by: Zoltan Borok-Nagy <bo...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Add Jaro-winkler edit distance and similarity built-in function
> ---------------------------------------------------------------
>
>                 Key: IMPALA-8752
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8752
>             Project: IMPALA
>          Issue Type: New Feature
>            Reporter: Norbert Luksa
>            Assignee: Norbert Luksa
>            Priority: Major
>              Labels: built-in-function
>
> References:
>  * [Apache commons - JaroWinklerDistance |[https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/JaroWinklerDistance.html]]
>  * [Apache commons - JaroWinklerSimilarity |[https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/JaroWinklerSimilarity.html]]
>  * [Oracle - JARO_WINKLER[_SIMILARITY]|[https://oracle-base.com/articles/11g/utl_match-string-matching-in-oracle]]
> Notable difference:
>  * With similarity, the Oracle version returns a normalized result ranging from 0 to 100.
>  * In the Appache version, null values result in exceptions.
>  * Apache rounds the values to two digitsĀ 
> The scaling factor of the algorithm can be added as an extra/default argument.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org