You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@asterixdb.apache.org by "Taewoo Kim (JIRA)" <ji...@apache.org> on 2017/02/01 06:10:51 UTC

[jira] [Updated] (ASTERIXDB-1778) Calculating the edit-distance in SimilarityMetricEditDistance class can be improved.

     [ https://issues.apache.org/jira/browse/ASTERIXDB-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Taewoo Kim updated ASTERIXDB-1778:
----------------------------------
    Description: 
Local optimization of the current edit-distance handling code in AsterixDB: Can we terminate the edit distance calculation based on a given edit-distance threshold T? I think the answer is yes.

Basic Algorithm:
For two strings X and Y, 

Init:
Construct D(len(X)+1=M, len(Y)+1=N).
D(i,0) = i
D(j,0) = j

Iteration:
{code}
For each i = 1 to M
    For each j = 1 to N
        D(i,j) = min( D(i-1,j) + 1 // deletion case
                      D(i,j-1) + 1 // insertion case
                      D(i-1,j-1) + cost (cost is 2 if X(i) != Y(j), 
                                                 cost is 0 if X(i) = Y(j))
                    )
{code}

Result:                    
D(M,N) is the edit distance value.

Example:


Early Termination condition:
So, for the given threshold T, we can stop the computation early if the values of three cases are greater than T. That is,

min (D(i-1,j), D(i,j-1), D(i-1,j-1)) > T

This holds since if the cost of all possible cases (insertion, deletion, and substitution) is greater than T, all future operations will be greater than T in any cases.

  was:
Local optimization of the current edit-distance handling code in AsterixDB: Can we terminate the edit distance calculation based on a given edit-distance threshold T? I think the answer is yes.

Basic Algorithm:
For two strings X and Y, 

Init:
Construct D(len(X)+1=M, len(Y)+1=N).
D(i,0) = i
D(j,0) = j

Iteration:
For each i = 1 to M
    For each j = 1 to N
        D(i,j) = min( D(i-1,j) + 1 // deletion case
                      D(i,j-1) + 1 // insertion case
                      D(i-1,j-1) + cost (cost is 2 if X(i) != Y(j), 
                                                 cost is 0 if X(i) = Y(j))
                    )

Result:                    
D(M,N) is the edit distance value.

Example:


Early Termination condition:
So, for the given threshold T, we can stop the computation early if the values of three cases are greater than T. That is,

min (D(i-1,j), D(i,j-1), D(i-1,j-1)) > T

This holds since if the cost of all possible cases (insertion, deletion, and substitution) is greater than T, all future operations will be greater than T in any cases.


> Calculating the edit-distance in SimilarityMetricEditDistance class can be improved.
> ------------------------------------------------------------------------------------
>
>                 Key: ASTERIXDB-1778
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1778
>             Project: Apache AsterixDB
>          Issue Type: Improvement
>            Reporter: Taewoo Kim
>            Priority: Minor
>
> Local optimization of the current edit-distance handling code in AsterixDB: Can we terminate the edit distance calculation based on a given edit-distance threshold T? I think the answer is yes.
> Basic Algorithm:
> For two strings X and Y, 
> Init:
> Construct D(len(X)+1=M, len(Y)+1=N).
> D(i,0) = i
> D(j,0) = j
> Iteration:
> {code}
> For each i = 1 to M
>     For each j = 1 to N
>         D(i,j) = min( D(i-1,j) + 1 // deletion case
>                       D(i,j-1) + 1 // insertion case
>                       D(i-1,j-1) + cost (cost is 2 if X(i) != Y(j), 
>                                                  cost is 0 if X(i) = Y(j))
>                     )
> {code}
> Result:                    
> D(M,N) is the edit distance value.
> Example:
> Early Termination condition:
> So, for the given threshold T, we can stop the computation early if the values of three cases are greater than T. That is,
> min (D(i-1,j), D(i,j-1), D(i-1,j-1)) > T
> This holds since if the cost of all possible cases (insertion, deletion, and substitution) is greater than T, all future operations will be greater than T in any cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)