You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by GitBox <gi...@apache.org> on 2022/01/11 17:27:52 UTC

[GitHub] [daffodil] stevedlawrence opened a new pull request #724: Improve performance of XMLUtils.coalesceAllAdjacentTextNodes

stevedlawrence opened a new pull request #724:
URL: https://github.com/apache/daffodil/pull/724


   Accessing the length of a Seq is linear with the size of the Seq. And
   accessing an index of a Seq is also linear with the index to get. The
   coalesceAdjacentTextNodes function has a loop that iterates over the
   children and gets both the length and index of the children Seq for
   every iteration, making this polynomial complexity. For very large
   infosets with many many children, the complexity explodes and takes an
   incredibly long time, in one case up to 8 hours.
   
   This function is called by the XMLUtils.normalize() function, which is
   used for comparing TDML infosets and building the NullInfosetOuputter,
   so for large infosets, these common functions can be very slow.
   
   This changes the loop to use a foreach instead, which avoids the length
   and indexing calculation, making this function linear and allowing TDML
   test compaison and NullInfosetOutputters to complete in a reasonable
   amount of time for large infosets (e.g. seconds instead of hours for
   large infosets).
   
   DAFFODIL-2619


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] mbeckerle commented on a change in pull request #724: Improve performance of XMLUtils.coalesceAllAdjacentTextNodes

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on a change in pull request #724:
URL: https://github.com/apache/daffodil/pull/724#discussion_r783155312



##########
File path: daffodil-lib/src/main/scala/org/apache/daffodil/xml/XMLUtils.scala
##########
@@ -360,9 +360,7 @@ object XMLUtils {
         tn = null
       }
     }
-    while (i < seq.length) {

Review comment:
       A bit more history of this. In Lisp, short lists get used as tuples. As a typeless language, there was no need for real tuples though they did get added eventually when people started worrying about performance.
   
   Since short lists were a common use case, treating them like little arrays didn't hurt much, and generic operations on lists vs. arrays seemed sensible.  
   
   And the rest is the unfortunate history of this mistake. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence commented on a change in pull request #724: Improve performance of XMLUtils.coalesceAllAdjacentTextNodes

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on a change in pull request #724:
URL: https://github.com/apache/daffodil/pull/724#discussion_r783061027



##########
File path: daffodil-lib/src/main/scala/org/apache/daffodil/xml/XMLUtils.scala
##########
@@ -360,9 +360,7 @@ object XMLUtils {
         tn = null
       }
     }
-    while (i < seq.length) {

Review comment:
       In this case, `child` is actually just a `Seq`. I think directly accessing the children like this is the only thing in Scala XML API that returns a `Seq`, everything else is a `NodeSeq`.
   
   And note that it's actually even worse, because we don't actually know the performance characteristic of the child `Seq`. A `Seq` is just a generic trait that requires implementations of functions like `apply`, `length`, `head`, `tail`, etc, but makes no guarantees about the performance of those functions.
   
   Many of the Scala collections implement `Seq` and have very different performance complexities:
   
   https://docs.scala-lang.org/overviews/collections/performance-characteristics.html
   
   So for example, if you want to append to a `Seq` it might be linear or constant depending on if your `Seq` is actually a `List` or a mutable `Queue`.
   
   So I think the moral of the story is if you are dealing with `Seq`, you really have no real guarantees about the performance complexity unless you're confident about the underlying type, which could even change if you're getting if from some library like we do in Scala XML. In most cases, that underlying type is probably a `List` since that's what you get by default if you just create a new `Seq`, but care must be taken, especially in loops or critical sections.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] mbeckerle commented on a change in pull request #724: Improve performance of XMLUtils.coalesceAllAdjacentTextNodes

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on a change in pull request #724:
URL: https://github.com/apache/daffodil/pull/724#discussion_r782527095



##########
File path: daffodil-lib/src/main/scala/org/apache/daffodil/xml/XMLUtils.scala
##########
@@ -360,9 +360,7 @@ object XMLUtils {
         tn = null
       }
     }
-    while (i < seq.length) {

Review comment:
       Always bugs the crap out of me when people use the word "sequence" instead of "list", but the implementation doesn't have O(1) indexing and length. 
   If it's a linked list, call it a LIST !
   In this case; however, it's a sequence of XML nodes right? I.e., a scala.xml.NodeSeq ? 
   
   Suggest adding a comment to the code just to remind people that scala.xml node sequences are LISTS with linear time access to an indexed location or to determine length, so map, or foreach are preferred for traversing them. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence merged pull request #724: Improve performance of XMLUtils.coalesceAllAdjacentTextNodes

Posted by GitBox <gi...@apache.org>.
stevedlawrence merged pull request #724:
URL: https://github.com/apache/daffodil/pull/724


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] mbeckerle commented on a change in pull request #724: Improve performance of XMLUtils.coalesceAllAdjacentTextNodes

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on a change in pull request #724:
URL: https://github.com/apache/daffodil/pull/724#discussion_r783150446



##########
File path: daffodil-lib/src/main/scala/org/apache/daffodil/xml/XMLUtils.scala
##########
@@ -360,9 +360,7 @@ object XMLUtils {
         tn = null
       }
     }
-    while (i < seq.length) {

Review comment:
       When generic sequences were introduced in Common Lisp back in the 1980s, I told people I thought they were a terrible idea, because of exactly the problem we have here. You can't write generic code using them other than map/foreach without gambling about the performance. 
   
   To me generic sequences are pointless unless you make them ONLY support map and foreach (and maybe .head, but not .tail)
   
   Note that it is not just "assuming an array, but got a list" that can cause quadratic (or worse) behavior. A list algorithm, that walks down the generic sequence calling ".head" and ".tail" is linear for a list, and quadratic for an array. 
   
   So to me these things are anathema. They violate an important design principle - a method should not be an overload of the same method name, if the implementation is algorithmically a different complexity. You should name it something else if that is the case even if the method signature (number and type of args and result) is the same. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org