You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2019/08/13 18:38:00 UTC
[jira] [Commented] (IMPALA-8752) Add Jaro-winkler edit distance and
similarity built-in function
[ https://issues.apache.org/jira/browse/IMPALA-8752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906504#comment-16906504 ]
ASF subversion and git services commented on IMPALA-8752:
---------------------------------------------------------
Commit 8db7f27ddde226f3efd3bddcc00665d0d9b99ef0 in impala's branch refs/heads/master from luksan47
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8db7f27 ]
IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function
The added functions return the Jaro/Jaro-Winkler similarity/distance
of two strings. The algorithm calcuates the Jaro-Similarity of the
strings, then adds more weight to the result if there are
common prefixes. (Jaro-Winkler)
For more detail, see:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
Extended the algorithm with another optional parameter: boost threshold
The prefix weight will only be applied if the Jaro-similarity
exceeds the given threshold. By default, its value is 0.7.
The new built-in functions are:
* jaro_distance, jaro_dst
* jaro_similarity, jaro_sim
* jaro_winkler_distance, jw_dst
* jaro_winkler_similarity, jw_sim
Testing:
* Added unit tests to expr-test.cc
* Manual testing over 1400 word pairs from
http://marvin.cs.uidaho.edu/misspell.html
Results match Apache commons
Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Reviewed-on: http://gerrit.cloudera.org:8080/13870
Reviewed-by: Zoltan Borok-Nagy <bo...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
> Add Jaro-winkler edit distance and similarity built-in function
> ---------------------------------------------------------------
>
> Key: IMPALA-8752
> URL: https://issues.apache.org/jira/browse/IMPALA-8752
> Project: IMPALA
> Issue Type: New Feature
> Reporter: Norbert Luksa
> Assignee: Norbert Luksa
> Priority: Major
> Labels: built-in-function
>
> References:
> * [Apache commons - JaroWinklerDistance |[https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/JaroWinklerDistance.html]]
> * [Apache commons - JaroWinklerSimilarity |[https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/JaroWinklerSimilarity.html]]
> * [Oracle - JARO_WINKLER[_SIMILARITY]|[https://oracle-base.com/articles/11g/utl_match-string-matching-in-oracle]]
> Notable difference:
> * With similarity, the Oracle version returns a normalized result ranging from 0 to 100.
> * In the Appache version, null values result in exceptions.
> * Apache rounds the values to two digitsĀ
> The scaling factor of the algorithm can be added as an extra/default argument.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org