You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by GitBox <gi...@apache.org> on 2021/07/21 12:44:41 UTC

[GitHub] [commons-text] kinow commented on a change in pull request #233: A More Efficient Implementation for Calculating Size of Longest Common Subsequence

kinow commented on a change in pull request #233:
URL: https://github.com/apache/commons-text/pull/233#discussion_r673904211



##########
File path: src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##########
@@ -58,7 +58,57 @@ public Integer apply(final CharSequence left, final CharSequence right) {
         if (left == null || right == null) {
             throw new IllegalArgumentException("Inputs must not be null");
         }
-        return longestCommonSubsequence(left, right).length();
+        // Find lengths of two strings
+        final int leftSz = left.length();
+        final int rightSz = right.length();
+
+        // Check if we can save even more space
+        if (leftSz < rightSz) {
+            return algorithmB(right, rightSz, left, leftSz)[leftSz];
+        }
+        return algorithmB(left, leftSz, right, rightSz)[rightSz];
+    }
+
+    /**
+     * An implementation of "ALG B" from Hirschberg's paper <a href="https://dl.acm.org/doi/10.1145/360825.360861">A linear space algorithm for computing maximal common subsequences</a>.
+     * Assuming the sequence <code>left</code> is of size <code>m</code> and the sequence <code>right</code> is of size <code>n</code>,
+     * this method returns the last row of the dynamic programming table when calculating LCS the two sequences.
+     * Therefore, the last element of the returned array, is the size of LCS of <code>left</code> and <code>right</code>.
+     * This method runs in O(m * n) time and O(n) space.
+     * To save more space, it is preferable to pass the shorter sequence as <code>right</code>.
+     *
+     * @param left Left sequence
+     * @param m Length of left sequence
+     * @param right Right sequence
+     * @param n Length of right sequence
+     * @return Last row of DP table for calculating LCS of <code>left</code> and <code>right</code>
+     */
+    static int[] algorithmB(final CharSequence left, final int m,

Review comment:
       Reading the paper, now I understand the reason for using this name, good idea as it helps others to quickly locate themselves in the code & paper. Thanks! Note to self, our old `longestCommonSubstringLengthArray` is algorithm A in the paper linked.

##########
File path: src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##########
@@ -58,7 +58,48 @@ public Integer apply(final CharSequence left, final CharSequence right) {
         if (left == null || right == null) {
             throw new IllegalArgumentException("Inputs must not be null");
         }
-        return longestCommonSubsequence(left, right).length();
+        // Find lengths of two strings
+        final int leftSz = left.length();
+        final int rightSz = right.length();
+
+        // Check if we can save even more space
+        if (leftSz < rightSz) {
+            return calculateLCSLength(right, rightSz, left, leftSz);
+        }
+        return calculateLCSLength(left, leftSz, right, rightSz);
+    }
+
+    // Assuming the sequence "left" is of size m and the sequence "right" is of size n,
+    // this method calculates the length of LCS the two sequences in O(m*n) time and O(n) space.
+    // Since LCS(S1, S2) = LCS(S2, S1), it is preferable to pass shorter sequence as "right,"
+    // so that we can save even more space.
+    // This is an implementation of Hirschberg's algorithm:
+    // D. S. Hirschberg, "A Linear Space Algorithm for Computing Maximal Common Subsequences," Communications of the ACM, vol. 18, no. 6, pp. 341--343, 1975
+    static int calculateLCSLength(final CharSequence left, final int m,
+                                  final CharSequence right, final int n) {
+        // We only need to keep track of last two rows
+        final int[][] dp = new int[2][1 + n];

Review comment:
       @ali-ghanbari I think the paper uses `K`, any reason why `dp` instead of `k` (I guess `K` would upset Java linters)?

##########
File path: src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##########
@@ -58,7 +58,57 @@ public Integer apply(final CharSequence left, final CharSequence right) {
         if (left == null || right == null) {
             throw new IllegalArgumentException("Inputs must not be null");
         }
-        return longestCommonSubsequence(left, right).length();
+        // Find lengths of two strings
+        final int leftSz = left.length();
+        final int rightSz = right.length();
+
+        // Check if we can save even more space
+        if (leftSz < rightSz) {
+            return algorithmB(right, rightSz, left, leftSz)[leftSz];
+        }
+        return algorithmB(left, leftSz, right, rightSz)[rightSz];
+    }
+
+    /**
+     * An implementation of "ALG B" from Hirschberg's paper <a href="https://dl.acm.org/doi/10.1145/360825.360861">A linear space algorithm for computing maximal common subsequences</a>.
+     * Assuming the sequence <code>left</code> is of size <code>m</code> and the sequence <code>right</code> is of size <code>n</code>,
+     * this method returns the last row of the dynamic programming table when calculating LCS the two sequences.
+     * Therefore, the last element of the returned array, is the size of LCS of <code>left</code> and <code>right</code>.
+     * This method runs in O(m * n) time and O(n) space.
+     * To save more space, it is preferable to pass the shorter sequence as <code>right</code>.
+     *
+     * @param left Left sequence
+     * @param m Length of left sequence
+     * @param right Right sequence
+     * @param n Length of right sequence
+     * @return Last row of DP table for calculating LCS of <code>left</code> and <code>right</code>
+     */
+    static int[] algorithmB(final CharSequence left, final int m,
+                            final CharSequence right, final int n) {
+        final int[][] dp = new int[2][1 + n];
+
+        for (int i = 1; i <= m; i++) {
+            // K(0, j) <- K(1, j) [j = 0...n], as per the paper:
+            // Since we have references in Java, we can swap references instead of literal copying.
+            // We could also use a "binary index" using modulus operator, but directly swapping the
+            // two rows helps readability and keeps the code consistent with the algorithm description

Review comment:
       >helps readability and keeps the code consistent with the algorithm
   
   :+1: thanks!

##########
File path: src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##########
@@ -58,7 +58,57 @@ public Integer apply(final CharSequence left, final CharSequence right) {
         if (left == null || right == null) {
             throw new IllegalArgumentException("Inputs must not be null");
         }
-        return longestCommonSubsequence(left, right).length();
+        // Find lengths of two strings
+        final int leftSz = left.length();
+        final int rightSz = right.length();
+
+        // Check if we can save even more space
+        if (leftSz < rightSz) {
+            return algorithmB(right, rightSz, left, leftSz)[leftSz];
+        }
+        return algorithmB(left, leftSz, right, rightSz)[rightSz];
+    }
+
+    /**
+     * An implementation of "ALG B" from Hirschberg's paper <a href="https://dl.acm.org/doi/10.1145/360825.360861">A linear space algorithm for computing maximal common subsequences</a>.

Review comment:
       This reference is fine here, but I think it's worth either moving or duplicating in the main comment for this class, please. There's another reference there already.

##########
File path: src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##########
@@ -58,7 +58,57 @@ public Integer apply(final CharSequence left, final CharSequence right) {
         if (left == null || right == null) {
             throw new IllegalArgumentException("Inputs must not be null");
         }
-        return longestCommonSubsequence(left, right).length();
+        // Find lengths of two strings
+        final int leftSz = left.length();
+        final int rightSz = right.length();
+
+        // Check if we can save even more space
+        if (leftSz < rightSz) {
+            return algorithmB(right, rightSz, left, leftSz)[leftSz];
+        }
+        return algorithmB(left, leftSz, right, rightSz)[rightSz];
+    }
+
+    /**
+     * An implementation of "ALG B" from Hirschberg's paper <a href="https://dl.acm.org/doi/10.1145/360825.360861">A linear space algorithm for computing maximal common subsequences</a>.
+     * Assuming the sequence <code>left</code> is of size <code>m</code> and the sequence <code>right</code> is of size <code>n</code>,
+     * this method returns the last row of the dynamic programming table when calculating LCS the two sequences.
+     * Therefore, the last element of the returned array, is the size of LCS of <code>left</code> and <code>right</code>.
+     * This method runs in O(m * n) time and O(n) space.
+     * To save more space, it is preferable to pass the shorter sequence as <code>right</code>.

Review comment:
       > To save more space, it is preferable to pass the shorter sequence as <code>right</code>.
   
   Worth having a pre-step that checks the length and swaps left/right if needed? Quite sure we have this happening somewhere else in Commons Text already.

##########
File path: src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##########
@@ -120,24 +170,72 @@ public CharSequence longestCommonSubsequence(final CharSequence left, final Char
        if (left == null || right == null) {
            throw new IllegalArgumentException("Inputs must not be null");
        }
-       final StringBuilder longestCommonSubstringArray = new StringBuilder(Math.max(left.length(), right.length()));
-       final int[][] lcsLengthArray = longestCommonSubstringLengthArray(left, right);

Review comment:
       `longestCommonSubstringLengthArray` is not used anymore. We cannot remove it due to binary backward compatibility. We need to mark that method as deprecated. There's another method deprecated in this class, `logestCommonSubsequence`, due to a typo. You can use that one as reference. Just use the annotation and add a small comment saying why it was deprecated, and that it will be removed in 2.0 (helps us remember to do so when the time comes).

##########
File path: src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##########
@@ -58,7 +58,57 @@ public Integer apply(final CharSequence left, final CharSequence right) {
         if (left == null || right == null) {
             throw new IllegalArgumentException("Inputs must not be null");
         }
-        return longestCommonSubsequence(left, right).length();
+        // Find lengths of two strings
+        final int leftSz = left.length();
+        final int rightSz = right.length();
+
+        // Check if we can save even more space
+        if (leftSz < rightSz) {
+            return algorithmB(right, rightSz, left, leftSz)[leftSz];
+        }
+        return algorithmB(left, leftSz, right, rightSz)[rightSz];
+    }
+
+    /**
+     * An implementation of "ALG B" from Hirschberg's paper <a href="https://dl.acm.org/doi/10.1145/360825.360861">A linear space algorithm for computing maximal common subsequences</a>.
+     * Assuming the sequence <code>left</code> is of size <code>m</code> and the sequence <code>right</code> is of size <code>n</code>,
+     * this method returns the last row of the dynamic programming table when calculating LCS the two sequences.

Review comment:
       s/calculating LCS the two sequences/calculating the LCS of the two sequences

##########
File path: src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##########
@@ -120,24 +170,72 @@ public CharSequence longestCommonSubsequence(final CharSequence left, final Char
        if (left == null || right == null) {
            throw new IllegalArgumentException("Inputs must not be null");
        }
-       final StringBuilder longestCommonSubstringArray = new StringBuilder(Math.max(left.length(), right.length()));
-       final int[][] lcsLengthArray = longestCommonSubstringLengthArray(left, right);
-       int i = left.length() - 1;
-       int j = right.length() - 1;
-       int k = lcsLengthArray[left.length()][right.length()] - 1;
-       while (k >= 0) {
-           if (left.charAt(i) == right.charAt(j)) {
-               longestCommonSubstringArray.append(left.charAt(i));
-               i = i - 1;
-               j = j - 1;
-               k = k - 1;
-           } else if (lcsLengthArray[i + 1][j] < lcsLengthArray[i][j + 1]) {
-               i = i - 1;
-           } else {
-               j = j - 1;
-           }
+       // Find lengths of two strings
+       final int leftSz = left.length();
+       final int rightSz = right.length();
+
+       // Check if we can save even more space
+       if (leftSz < rightSz) {
+           return algorithmC(right, rightSz, left, leftSz);
        }
-       return longestCommonSubstringArray.reverse().toString();
+       return algorithmC(left, leftSz, right, rightSz);
+   }
+
+    /**
+     * An implementation of "ALG C" from Hirschberg's paper <a href="https://dl.acm.org/doi/10.1145/360825.360861">A linear space algorithm for computing maximal common subsequences</a>.
+     * Assuming the sequence <code>left</code> is of size <code>m</code> and the sequence <code>right</code> is of size <code>n</code>,
+     * this method returns Longest Common Subsequence (LCS) the two sequences.
+     * As per the paper, this method runs in O(m * n) time and O(m + n) space.
+     *
+     * @param left Left sequence
+     * @param m Length of left sequence
+     * @param right Right sequence
+     * @param n Length of right sequence
+     * @return LCS of <code>left</code> and <code>right</code>
+     */
+   static CharSequence algorithmC(final CharSequence left, final int m,

Review comment:
       @ali-ghanbari if I understood correctly, Algorithm A and Algorithm B are very similar, but I think A requires `O(mn)` time and `O(mn)` space. While Algorithm B requires `O(mn)` time and `O(m+n)` space.
   
   And, Algorithm C is the third one in that paper, with `O(mn)` time as Algorithms A and B, and `O(m+n)` space. With the difference between B and C being that B uses temporary memory linear to `m` and `n`. Whereas Algorithm C uses a constant amount of memory.
   
   (Past midnight here, long time with no :coffee: so forgive me if I missed something, or said something wrong above :+1: )
   
   The `LongestCommonSubsequence` is a function supposed to be called as a `BiFunction`, with `apply()` being the main method, and entry point to the algorithm.
   
   With the changes here, I think now `apply` uses Algorithm B. While this method with Algorithm C is called only from the public `longestCommonSubsequence`.
   
   Is there a reason for having both implementations here, instead of just Algorithm C and calling it from the `apply` method?

##########
File path: src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##########
@@ -58,7 +58,48 @@ public Integer apply(final CharSequence left, final CharSequence right) {
         if (left == null || right == null) {
             throw new IllegalArgumentException("Inputs must not be null");
         }
-        return longestCommonSubsequence(left, right).length();
+        // Find lengths of two strings
+        final int leftSz = left.length();
+        final int rightSz = right.length();
+
+        // Check if we can save even more space
+        if (leftSz < rightSz) {
+            return calculateLCSLength(right, rightSz, left, leftSz);
+        }
+        return calculateLCSLength(left, leftSz, right, rightSz);
+    }
+
+    // Assuming the sequence "left" is of size m and the sequence "right" is of size n,
+    // this method calculates the length of LCS the two sequences in O(m*n) time and O(n) space.
+    // Since LCS(S1, S2) = LCS(S2, S1), it is preferable to pass shorter sequence as "right,"
+    // so that we can save even more space.
+    // This is an implementation of Hirschberg's algorithm:
+    // D. S. Hirschberg, "A Linear Space Algorithm for Computing Maximal Common Subsequences," Communications of the ACM, vol. 18, no. 6, pp. 341--343, 1975
+    static int calculateLCSLength(final CharSequence left, final int m,
+                                  final CharSequence right, final int n) {
+        // We only need to keep track of last two rows
+        final int[][] dp = new int[2][1 + n];

Review comment:
       >As for this comment, how about returning a pre-allocated singleton array with a 0 in it?
   
   Sounds good to me :+1: worth adding a comment to the array so we don't accidentally make it public.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org