You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by GitBox <gi...@apache.org> on 2021/08/06 01:40:57 UTC

[GitHub] [commons-text] ali-ghanbari commented on a change in pull request #233: [TEXT-212] A More Efficient Implementation for Calculating Size of Longest Common Subsequence

ali-ghanbari commented on a change in pull request #233:
URL: https://github.com/apache/commons-text/pull/233#discussion_r683888067



##########
File path: src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##########
@@ -29,36 +29,109 @@
  * </p>
  *
  * <p>
- * This implementation is based on the Longest Commons Substring algorithm
- * from <a href="https://en.wikipedia.org/wiki/Longest_common_subsequence_problem">
- * https://en.wikipedia.org/wiki/Longest_common_subsequence_problem</a>.
+ * As of version 1.10, a more space-efficient of the algorithm is implemented. The new algorithm has linear space
+ * complexity instead of quadratic. However, time complexity is still quadratic in the size of input strings.
  * </p>
  *
- * <p>For further reading see:</p>
+ * <p>
+ * The implementation is based on Hirschberg's Longest Commons Substring algorithm (cited below).
+ * </p>
  *
- * <p>Lothaire, M. <i>Applied combinatorics on words</i>. New York: Cambridge U Press, 2005. <b>12-13</b></p>
+ * <p>For further reading see:</p>
+ * <ul>
+ * <li>
+ * Lothaire, M. <i>Applied combinatorics on words</i>. New York: Cambridge U Press, 2005. <b>12-13</b>
+ * </li>
+ * <li>
+ * D. S. Hirschberg, "A linear space algorithm for computing maximal common subsequences," CACM, 1975, pp. 341--343.
+ * </li>
+ * </ul>
  *
  * @since 1.0
  */
 public class LongestCommonSubsequence implements SimilarityScore<Integer> {
-
     /**
-     * Calculates longest common subsequence similarity score of two {@code CharSequence}'s passed as
+     * Calculates the longest common subsequence similarity score of two {@code CharSequence}'s passed as
      * input.
      *
-     * @param left first character sequence
-     * @param right second character sequence
-     * @return longestCommonSubsequenceLength
-     * @throws IllegalArgumentException
-     *             if either String input {@code null}
+     * <p>
+     * This method implements a more efficient version of LCS algorithm which has quadratic time and
+     * linear space complexity.
+     * </p>
+     *
+     * <p>
+     * This method is based on newly implemented {@link #algorithmB(CharSequence, CharSequence)}.
+     * An evaluation using JMH revealed that this method is almost two times faster than its previous version.
+     * </p>
+     *
+     * @param left First character sequence
+     * @param right Second character sequence
+     * @return Length of the longest common subsequence of <code>left</code> and <code>right</code>
+     * @throws IllegalArgumentException if either String input {@code null}
      */
     @Override
     public Integer apply(final CharSequence left, final CharSequence right) {
         // Quick return for invalid inputs
         if (left == null || right == null) {
             throw new IllegalArgumentException("Inputs must not be null");
         }
-        return longestCommonSubsequence(left, right).length();
+        // Find lengths of two strings
+        final int leftSz = left.length();
+        final int rightSz = right.length();
+
+        // Check if we can avoid calling algorithmB which involves heap space allocation
+        if (leftSz == 0 || rightSz == 0) {
+            return 0;
+        }
+
+        // Check if we can save even more space
+        if (leftSz < rightSz) {
+            return algorithmB(right, left)[leftSz];
+        }
+        return algorithmB(left, right)[rightSz];
+    }
+
+    /**
+     * An implementation of "ALG B" from Hirschberg's CACM '71 paper.
+     * Assuming the first input sequence is of size <code>m</code> and the second input sequence is of size
+     * <code>n</code>, this method returns the last row of the dynamic programming (DP) table when calculating
+     * the LCS of the two sequences in <i>O(m*n)</i> time and <i>O(n)</i> space.

Review comment:
       Hi @kinow , thank you so much for your review :)
   
   >  ... I think because you are always passing the smallest as the second sequence, then it'll be O(n)?
   
   Yes, that is because of the way we allocate the array.
   
   > Although algorithmC does similar trick, but in the javadocs we say O(m+n) space
   
   Yes, _on the heap_ it is O(n), but since the implementation is recursive, it goes O(m) deep in the stack also.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org