You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by GitBox <gi...@apache.org> on 2022/01/05 16:54:19 UTC

[GitHub] [daffodil] stevedlawrence opened a new pull request #721: Support --infoset null for CLI unparsing

stevedlawrence opened a new pull request #721:
URL: https://github.com/apache/daffodil/pull/721


   A "null" infoset inputter does not normally make sense in the context of
   unparsing because there must always be an infoset to unparse--you can't
   unparse null data. But in the context of performance, a "null" infoset
   inputter could be considered one that has close to zero overhead
   (similar to the null infoset outputter for parsing), which can help to
   understand the raw speed of Daffodil unparsing operations and to compare
   different infoset inputter overheads.
   
   To support this, this allows "null" to be used for the --infoset option
   when unparsing. This triggers the use of a new NullInfosetInputter,
   which uses a pre-created array of infoset events that are exactly what
   the Daffodil unparser expects. The events are created outside of
   measured unparser code, so the only overhead that occurs inside measured
   code is indexing into the event array and returning fields from the
   indexed event, which should give good picture of raw unparse speed.
   
   DAFFODIL-2619


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence commented on a change in pull request #721: Support --infoset null for CLI unparsing

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on a change in pull request #721:
URL: https://github.com/apache/daffodil/pull/721#discussion_r780390172



##########
File path: daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/NullInfosetInputter.scala
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.daffodil.infoset
+
+import java.io.InputStream
+
+import scala.collection.mutable.ArrayBuffer
+
+import scala.xml.Elem
+import scala.xml.Node
+import scala.xml.SAXParser
+import scala.xml.Text
+import scala.xml.XML
+
+import org.apache.daffodil.dpath.NodeInfo
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
+import org.apache.daffodil.xml.DaffodilSAXParserFactory
+import org.apache.daffodil.xml.XMLUtils
+
+object NullInfosetInputter {
+
+  case class Event(
+    eventType: InfosetInputterEventType,
+    localName: String = null,

Review comment:
       I've looked at this more to see if it's even possible, and I'm not sure it's possible without pretty big changes. An `InfosetInputter` implementation doesn't currently know anything about ERDs. And the `InfosetInputter` base class only learns about the root ERD when `initialize()` is called, which happens at the beginning of unparse. So there is no way for an InfosetInputter to know about any ERDs prior to unparsing, and so there's no way to figure out all the information needed for `nextElementERD`.
   
   If we really wanted a super minimal `InfosetInputter`, we would have to do as you suggest and create a custom `InfosetOutputter` that serializes the infoset along with all ERD information, and then a custom InfosetInputter could use that to recreate events and avoid `nextElementErd` overhead. But the feels like quite a bit of work to just exclude the overhead of `nextElementErd()`. And since `nextElementErd()` isn't specific to any `InfosetInputter` implementation, I'd rather we just add instrumentation or profile it separately and find ways to optimize it, rather than creating a special `InfosetInputter`/`Outputter` that avoids it entirely.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence commented on a change in pull request #721: Support --infoset null for CLI unparsing

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on a change in pull request #721:
URL: https://github.com/apache/daffodil/pull/721#discussion_r780390172



##########
File path: daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/NullInfosetInputter.scala
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.daffodil.infoset
+
+import java.io.InputStream
+
+import scala.collection.mutable.ArrayBuffer
+
+import scala.xml.Elem
+import scala.xml.Node
+import scala.xml.SAXParser
+import scala.xml.Text
+import scala.xml.XML
+
+import org.apache.daffodil.dpath.NodeInfo
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
+import org.apache.daffodil.xml.DaffodilSAXParserFactory
+import org.apache.daffodil.xml.XMLUtils
+
+object NullInfosetInputter {
+
+  case class Event(
+    eventType: InfosetInputterEventType,
+    localName: String = null,

Review comment:
       I've looked at this more to see if it's even possible, and I'm not sure it's possible without pretty big changes. An `InfosetInputter` implementation doesn't currently know anything about ERDs. And the `InfosetInputter` base class only learns about the root ERD when `initialize()` is called, which happens at the beginning of unparse. So there is no way for an InfosetInputter to know about any ERDs prior to unparsing, and so there's no way to figure out all the information needed for `nextElementERD`.
   
   If we really wanted a super minimal `InfosetInputter`, we would have to do as you suggest and create a custom `InfosetOutputter` that serializes the infoset along with all ERD information, and then a custom InfosetInputter could use that to recreate events and avoid `nextElementErd` overhead. But the feels like quite a bit of work to just exclude the overhead of `nextElementErd()`. And since `nextElementErd()` isn't specific to any `InfosetInputter` implementation, I'd rather we just add instrumentation or profile it separately and find ways to optimize it, rather than creating a special `InfosetInputter`/`Outputter` that avoids it entirely.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] tuxji commented on a change in pull request #721: Support --infoset null for CLI unparsing

Posted by GitBox <gi...@apache.org>.
tuxji commented on a change in pull request #721:
URL: https://github.com/apache/daffodil/pull/721#discussion_r779053191



##########
File path: daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/NullInfosetInputter.scala
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.daffodil.infoset
+
+import java.io.InputStream
+
+import scala.collection.mutable.ArrayBuffer
+
+import scala.xml.Elem
+import scala.xml.Node
+import scala.xml.SAXParser
+import scala.xml.Text
+import scala.xml.XML
+
+import org.apache.daffodil.dpath.NodeInfo
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
+import org.apache.daffodil.xml.DaffodilSAXParserFactory
+import org.apache.daffodil.xml.XMLUtils
+
+object NullInfosetInputter {
+
+  case class Event(
+    eventType: InfosetInputterEventType,
+    localName: String = null,
+    namespaceURI: String = null,
+    simpleText: String = null,
+    isNilled: MaybeBoolean = MaybeBoolean.Nope,
+  )
+
+  def createEvents(is: InputStream): Array[Event] = {
+    val elem = {
+      val parser: SAXParser = {
+        val f = DaffodilSAXParserFactory()
+        f.setNamespaceAware(false)
+        val p = f.newSAXParser()
+        p
+      }
+      val node = XML.withSAXParser(parser).load(is)
+      val normalized = XMLUtils.normalize(node)
+      normalized
+    }
+    val events = ArrayBuffer[Event]()
+    events += Event(StartDocument)
+    nodeToEvents(elem, events)
+    events += Event(EndDocument)
+    events.toArray
+  }
+
+  /**
+   * Recursively walk a Scala XML Node, converting each Elem to an Event
+   * and appends it to the events array
+   */
+  private def nodeToEvents(node: Node, events: ArrayBuffer[Event]): Unit = {
+    node match {
+      case elem: Elem => {
+        val isSimple = elem.child.length == 0 || elem.child(0).isInstanceOf[Text]
+        val localName = elem.label
+        val namespaceURI = elem.namespace
+        val (simpleText, isNilled) = if (isSimple) {
+          val text = XMLUtils.remapPUAToXMLIllegalCharacters(elem.text)
+          val isNilled = elem.attribute(XMLUtils.XSI_NAMESPACE, "nil").map { attrs =>
+            val str = attrs.head.toString
+            val value = str == "true" || str == "1"
+            MaybeBoolean(value)
+          }.getOrElse(MaybeBoolean.Nope)
+          (text, isNilled)
+        } else {
+          (null, MaybeBoolean.Nope)
+        }
+
+        events += Event(StartElement, localName, namespaceURI, simpleText, isNilled)
+        if (!isSimple) elem.child.foreach { nodeToEvents(_, events) }
+        events += Event(EndElement, localName, namespaceURI)
+      }
+      case _ => // ignore

Review comment:
       This line (91) isn't covered by tests, so please expand the comment to explain why we have a case _ and what kind of Scala XML object this line might ignore.

##########
File path: daffodil-cli/src/main/scala/org/apache/daffodil/Main.scala
##########
@@ -848,6 +846,9 @@ object Main {
         val tl = anyRef.asInstanceOf[ThreadLocal[org.w3c.dom.Document]]
         Left(new W3CDOMInfosetInputter(tl.get))
       }
+      case InfosetType.NULL => {
+        Left(new NullInfosetInputter(anyRef.asInstanceOf[Array[NullInfosetInputter.Event]]))

Review comment:
       Like the case above it, this case would be clearer if line 850 was more verbose like this:
   
   ```scala
           val events = anyRef.asInstanceOf[Array[NullInfosetInputter.Event]]
           Left(new NullInfosetInputter(events))
   ```
   
   On second thought, we should write a test to cover these lines - if we use only manual testing, these lines might break after a future change without us realizing it.

##########
File path: daffodil-cli/src/main/scala/org/apache/daffodil/Main.scala
##########
@@ -815,6 +806,13 @@ object Main {
           }
         }
       }
+      case InfosetType.NULL => {
+        val is = data match {
+          case Left(bytes) => new ByteArrayInputStream(bytes)
+          case Right(is) => is
+        }
+        NullInfosetInputter.createEvents(is)

Review comment:
       Lines 811, 814, and 850 aren't covered by tests (weird that line 812 is ignored) and my initial impression at reading line 814 was that it was a side-effect call.  I saw createEvents' definition later and realized it returns an Array[Event], but line 814 would be clearer if it read like this:
   
   ```scala
           val events = NullInfosetInputter.createEvents(is)
           events
   ```
   
   Still, I won't quibble about these lines not being covered by tests.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] mbeckerle commented on a change in pull request #721: Support --infoset null for CLI unparsing

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on a change in pull request #721:
URL: https://github.com/apache/daffodil/pull/721#discussion_r779024957



##########
File path: daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/NullInfosetInputter.scala
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.daffodil.infoset
+
+import java.io.InputStream
+
+import scala.collection.mutable.ArrayBuffer
+
+import scala.xml.Elem
+import scala.xml.Node
+import scala.xml.SAXParser
+import scala.xml.Text
+import scala.xml.XML
+
+import org.apache.daffodil.dpath.NodeInfo
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
+import org.apache.daffodil.xml.DaffodilSAXParserFactory
+import org.apache.daffodil.xml.XMLUtils
+
+object NullInfosetInputter {
+
+  case class Event(
+    eventType: InfosetInputterEventType,
+    localName: String = null,
+    namespaceURI: String = null,
+    simpleText: String = null,

Review comment:
       Similarly, the simpleText conversion into the binary simple data type is overhead we would not have if we were using our own serialization form, because the ERD would tell us exactly the simple type, so we could store the binary value. 
   
   I am less worried about this overhead than the nextElementErd() overhead however. But still if we have fine-enough granularity timers, we could time the conversion of the simple value from string to our value object that is stored in the DISimple node, and then that overhead could be excluded (or counted as part of serialization overhead.)

##########
File path: daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/NullInfosetInputter.scala
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.daffodil.infoset
+
+import java.io.InputStream
+
+import scala.collection.mutable.ArrayBuffer
+
+import scala.xml.Elem
+import scala.xml.Node
+import scala.xml.SAXParser
+import scala.xml.Text
+import scala.xml.XML
+
+import org.apache.daffodil.dpath.NodeInfo
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
+import org.apache.daffodil.xml.DaffodilSAXParserFactory
+import org.apache.daffodil.xml.XMLUtils
+
+object NullInfosetInputter {
+
+  case class Event(
+    eventType: InfosetInputterEventType,
+    localName: String = null,

Review comment:
       Can we exclude the overhead of nextElementErd() by timing it separately allowing us to include it, or exclude it from what we're counting? 
   
   The fact that the serialized XML or JSON or whatever doesn't exactly identify the ERD for us is overhead related to the serialized form. (If we invented our own serialized form we would put an unambiguous ERD identifier in the serialized form, so that there would be no nextElementErd() overhead. )




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence merged pull request #721: Support --infoset null for CLI unparsing

Posted by GitBox <gi...@apache.org>.
stevedlawrence merged pull request #721:
URL: https://github.com/apache/daffodil/pull/721


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence commented on a change in pull request #721: Support --infoset null for CLI unparsing

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on a change in pull request #721:
URL: https://github.com/apache/daffodil/pull/721#discussion_r779035698



##########
File path: daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/NullInfosetInputter.scala
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.daffodil.infoset
+
+import java.io.InputStream
+
+import scala.collection.mutable.ArrayBuffer
+
+import scala.xml.Elem
+import scala.xml.Node
+import scala.xml.SAXParser
+import scala.xml.Text
+import scala.xml.XML
+
+import org.apache.daffodil.dpath.NodeInfo
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
+import org.apache.daffodil.xml.DaffodilSAXParserFactory
+import org.apache.daffodil.xml.XMLUtils
+
+object NullInfosetInputter {
+
+  case class Event(
+    eventType: InfosetInputterEventType,
+    localName: String = null,

Review comment:
       I guess we could invent our own serialized form that could have ERD information and avoid that `nextElementEed()` overhead, but how practical is that in the real world? No other tools would know how to use this custom serialization, so being able to inspect/transform/validate it would be difficult. We would have to write custom tools to do all that, which seems like it removes the big benefit of Daffodil--that you can use a native serialization format that your inspect/transform/validate tools already understand and expect.
   
   Also, since all other InfosetInputter implementations other than this invented one would have this `nextElementErd()` overhead, it seems incorrect to exclude it from performance numbers. They will all  have this same `nextElementErd()` overhead (though, it should be a constant overhead regardless of `InfosetInputter` I think?).
   
   My goal here was have a way to exclude variances among different InfosetInputter implementations which can be wildly different (e.g. XMLTextInfosetInputter parsing text, SAXInfosetInputter threaded craziness, etc.). To me, this gives a good estimate of the fastest Daffodil can possible unparse. To go faster requires performance optimizations (like perhaps changing how `nextElementErd()` works).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence merged pull request #721: Support --infoset null for CLI unparsing

Posted by GitBox <gi...@apache.org>.
stevedlawrence merged pull request #721:
URL: https://github.com/apache/daffodil/pull/721


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [daffodil] stevedlawrence commented on a change in pull request #721: Support --infoset null for CLI unparsing

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on a change in pull request #721:
URL: https://github.com/apache/daffodil/pull/721#discussion_r779072919



##########
File path: daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/NullInfosetInputter.scala
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.daffodil.infoset
+
+import java.io.InputStream
+
+import scala.collection.mutable.ArrayBuffer
+
+import scala.xml.Elem
+import scala.xml.Node
+import scala.xml.SAXParser
+import scala.xml.Text
+import scala.xml.XML
+
+import org.apache.daffodil.dpath.NodeInfo
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
+import org.apache.daffodil.xml.DaffodilSAXParserFactory
+import org.apache.daffodil.xml.XMLUtils
+
+object NullInfosetInputter {
+
+  case class Event(
+    eventType: InfosetInputterEventType,
+    localName: String = null,
+    namespaceURI: String = null,
+    simpleText: String = null,
+    isNilled: MaybeBoolean = MaybeBoolean.Nope,
+  )
+
+  def createEvents(is: InputStream): Array[Event] = {
+    val elem = {
+      val parser: SAXParser = {
+        val f = DaffodilSAXParserFactory()
+        f.setNamespaceAware(false)
+        val p = f.newSAXParser()
+        p
+      }
+      val node = XML.withSAXParser(parser).load(is)
+      val normalized = XMLUtils.normalize(node)
+      normalized
+    }
+    val events = ArrayBuffer[Event]()
+    events += Event(StartDocument)
+    nodeToEvents(elem, events)
+    events += Event(EndDocument)
+    events.toArray
+  }
+
+  /**
+   * Recursively walk a Scala XML Node, converting each Elem to an Event
+   * and appends it to the events array
+   */
+  private def nodeToEvents(node: Node, events: ArrayBuffer[Event]): Unit = {
+    node match {
+      case elem: Elem => {
+        val isSimple = elem.child.length == 0 || elem.child(0).isInstanceOf[Text]
+        val localName = elem.label
+        val namespaceURI = elem.namespace
+        val (simpleText, isNilled) = if (isSimple) {
+          val text = XMLUtils.remapPUAToXMLIllegalCharacters(elem.text)
+          val isNilled = elem.attribute(XMLUtils.XSI_NAMESPACE, "nil").map { attrs =>
+            val str = attrs.head.toString
+            val value = str == "true" || str == "1"
+            MaybeBoolean(value)
+          }.getOrElse(MaybeBoolean.Nope)
+          (text, isNilled)
+        } else {
+          (null, MaybeBoolean.Nope)
+        }
+
+        events += Event(StartElement, localName, namespaceURI, simpleText, isNilled)
+        if (!isSimple) elem.child.foreach { nodeToEvents(_, events) }
+        events += Event(EndElement, localName, namespaceURI)
+      }
+      case _ => // ignore

Review comment:
       I think really this function should just expect an `Elem` instead of `Node`. Due to the `XMLutils.normalize` call it's impossible for the node parameter to not be a `Elem`. Will update it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org