You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Alexander Pivovarov (JIRA)" <ji...@apache.org> on 2015/02/03 07:15:35 UTC

[jira] [Updated] (HIVE-9559) Create UDF to measure strings similarity using q-gram distance algo

     [ https://issues.apache.org/jira/browse/HIVE-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexander Pivovarov updated HIVE-9559:
--------------------------------------
    Description: 
algo description http://stackoverflow.com/questions/1938678/q-gram-approximate-matching-optimisations

{code}
str_sim_qgrams("Test String1", "Test String2") = 0.78571427f
{code}

another example
{code}
> qgrams('abcde','abdcde',q=2)
   ab bc cd de dc bd
V1  1  1  1  1  0  0
V2  1  0  1  1  1  1
 
> stringdist('abcde', 'abdcde', method='qgram', q=2)
[1] 3
{code}

take SimMetrics as a reference implementation 
https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/QGramsDistance.java
https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/QGramsDistanceTest.java

> Create UDF to measure strings similarity using q-gram distance algo
> -------------------------------------------------------------------
>
>                 Key: HIVE-9559
>                 URL: https://issues.apache.org/jira/browse/HIVE-9559
>             Project: Hive
>          Issue Type: Improvement
>          Components: UDF
>            Reporter: Alexander Pivovarov
>            Assignee: Alexander Pivovarov
>
> algo description http://stackoverflow.com/questions/1938678/q-gram-approximate-matching-optimisations
> {code}
> str_sim_qgrams("Test String1", "Test String2") = 0.78571427f
> {code}
> another example
> {code}
> > qgrams('abcde','abdcde',q=2)
>    ab bc cd de dc bd
> V1  1  1  1  1  0  0
> V2  1  0  1  1  1  1
>  
> > stringdist('abcde', 'abdcde', method='qgram', q=2)
> [1] 3
> {code}
> take SimMetrics as a reference implementation 
> https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/QGramsDistance.java
> https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/QGramsDistanceTest.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)