You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by sl...@apache.org on 2022/01/12 14:26:27 UTC

[daffodil] branch main updated: Improve performance of XMLUtils.coalesceAllAdjacentTextNodes

This is an automated email from the ASF dual-hosted git repository.

slawrence pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/daffodil.git


The following commit(s) were added to refs/heads/main by this push:
     new 59c8ef8  Improve performance of XMLUtils.coalesceAllAdjacentTextNodes
59c8ef8 is described below

commit 59c8ef820da439e0cd7f7831652ac509ec8af0ad
Author: Steve Lawrence <sl...@apache.org>
AuthorDate: Wed Jan 12 08:27:45 2022 -0500

    Improve performance of XMLUtils.coalesceAllAdjacentTextNodes
    
    Accessing the length of a Seq is linear with the size of the Seq. And
    accessing an index of a Seq is also linear with the index to get. The
    coalesceAdjacentTextNodes function has a loop that iterates over the
    children and gets both the length and index of the children Seq for
    every iteration, making this polynomial complexity. For very large
    infosets with many many children, the complexity explodes and takes an
    incredibly long time, in one case up to 8 hours.
    
    This function is called by the XMLUtils.normalize() function, which is
    used for comparing TDML infosets and building the NullInfosetOuputter,
    so for large infosets, these common functions can be very slow.
    
    This changes the loop to use a foreach instead, which avoids the length
    and indexing calculation, making this function linear and allowing TDML
    test compaison and NullInfosetOutputters to complete in a reasonable
    amount of time for large infosets (e.g. seconds instead of hours for
    large infosets).
    
    DAFFODIL-2619
---
 .../main/scala/org/apache/daffodil/xml/XMLUtils.scala    | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/daffodil-lib/src/main/scala/org/apache/daffodil/xml/XMLUtils.scala b/daffodil-lib/src/main/scala/org/apache/daffodil/xml/XMLUtils.scala
index 7015f80..57ba052 100644
--- a/daffodil-lib/src/main/scala/org/apache/daffodil/xml/XMLUtils.scala
+++ b/daffodil-lib/src/main/scala/org/apache/daffodil/xml/XMLUtils.scala
@@ -360,9 +360,19 @@ object XMLUtils {
         tn = null
       }
     }
-    while (i < seq.length) {
-      val current = seq(i)
-      i = i + 1
+
+    // Note that there are no performance guarantees about the different
+    // functions of a Seq. Accessing length, updating an index, etc. may have
+    // constant, linear, or worse time-complexity depending on the underlying
+    // Seq implementation (usually a List, but not guaranteed) and could lead
+    // to algorithms that scale very poorly. If a collection function exists to
+    // do what we need (e.g. map, foreach), we should be sure to use that,
+    // which should hopefully have the best possible performance characteristics.
+    // If performance is crucial, we may even want to avoid Seq entirely and be
+    // explicit about the collection type needed to get the required
+    // performance characteristics. In this case, all we need to do is iterate
+    // over each NOde in the Seq, so foreach is appropriate.
+    seq.foreach { current =>
       if ((current.isInstanceOf[Text] || current.isInstanceOf[Unparsed])) {
         if (tn == null) {
           if (sb == null || sb.length == 0) {