You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Alexander Pivovarov (JIRA)" <ji...@apache.org> on 2015/02/03 07:15:35 UTC
[jira] [Updated] (HIVE-9559) Create UDF to measure strings
similarity using q-gram distance algo
[ https://issues.apache.org/jira/browse/HIVE-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Pivovarov updated HIVE-9559:
--------------------------------------
Description:
algo description http://stackoverflow.com/questions/1938678/q-gram-approximate-matching-optimisations
{code}
str_sim_qgrams("Test String1", "Test String2") = 0.78571427f
{code}
another example
{code}
> qgrams('abcde','abdcde',q=2)
ab bc cd de dc bd
V1 1 1 1 1 0 0
V2 1 0 1 1 1 1
> stringdist('abcde', 'abdcde', method='qgram', q=2)
[1] 3
{code}
take SimMetrics as a reference implementation
https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/QGramsDistance.java
https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/QGramsDistanceTest.java
> Create UDF to measure strings similarity using q-gram distance algo
> -------------------------------------------------------------------
>
> Key: HIVE-9559
> URL: https://issues.apache.org/jira/browse/HIVE-9559
> Project: Hive
> Issue Type: Improvement
> Components: UDF
> Reporter: Alexander Pivovarov
> Assignee: Alexander Pivovarov
>
> algo description http://stackoverflow.com/questions/1938678/q-gram-approximate-matching-optimisations
> {code}
> str_sim_qgrams("Test String1", "Test String2") = 0.78571427f
> {code}
> another example
> {code}
> > qgrams('abcde','abdcde',q=2)
> ab bc cd de dc bd
> V1 1 1 1 1 0 0
> V2 1 0 1 1 1 1
>
> > stringdist('abcde', 'abdcde', method='qgram', q=2)
> [1] 3
> {code}
> take SimMetrics as a reference implementation
> https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/QGramsDistance.java
> https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/QGramsDistanceTest.java
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)