You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by GitBox <gi...@apache.org> on 2022/08/04 15:50:54 UTC

[GitHub] [daffodil] stevedlawrence opened a new pull request, #819: Support XML strings in XMLTextInfosetInputter/Outputter

stevedlawrence opened a new pull request, #819:
URL: https://github.com/apache/daffodil/pull/819

   - When outputting string elements with the stringAsXml runtime property
     set to true, instead of escaping the simple element content, we
     instead output the content to raw XML as if it were part of the
     infoset. When inputting, we read the XML and convert it back to a
     string as the simple content. When the XML is added to the infoset, a
     wrapper element called "stringAsXml" is added to ensure the default
     namespace is reset. As an example:
   
     Infoset without the stringAsXml property:
     ```xml
     <foo>&lt;xml&gt;content&lt;/xml&gt;</foo>
     ```
     Infoset with the stringAsXml property:
   
     ```xml
     <foo><stringAsXml xmlns=""><xml>content</xml></stringAsXml></foo>
     ```
     Note that there are multiple ways to read/write XML that are
     syntactically different but semantically the same, making it unlikely
     that unparsed XML will be byte-for-byte the same as the original XML.
   
     Also note that the result of the XMLTextInfosetOutputter is used for
     "full" validation. Because this changes its output, essentially
     converting a simple string type into a complex type, this will break
     full validation if stringAsXml is used. If full validation is needed,
     one must use external validation with a modified schema.
   
   - We currently ignore the return value of InfosetOutputter functions,
     and any exceptions thrown just bubble to the top and appear as an
     unexpected exception. Instead, if the InfosetOutputter returns false
     or throws an exception, we create an SDE. The logic for finalizing the
     walker is moved into the doParse function so that the SDE is caught
     and correctly added as a diagnostic. This is need to handle
     non-well-formed XML.
   
   - We cannot use normal TDML tests to test this behavior because the
     XMLTextInfosetOutputter outputs a different infoset than the other
     infoset outputters, so Scala API tests are added to verify the correct
     output.
   
   DAFFODIL-2708


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] tuxji commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
tuxji commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r944479531


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,28 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  private def doOutputter(outputterFunc: => Boolean, desc: String, context: ThrowsSDE): Unit = {
+    Try(outputterFunc) match {
+      case Success(true) => // success
+      // $COVERAGE-OFF$
+      case Success(false) => Assert.usageError("InfosetOutputter false return value is deprecated. Throw an Exception instead.")
+      // $COVERAGE-ON$
+      case Failure(e) => {
+        val cause = e.getCause
+        val msg = if (cause == null) e.toString else cause.toString
+        context.SDE("Failed to %s: %s", desc, msg)

Review Comment:
   I like both the usageError deprecating Success(false) and the cause retrieval.  👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] tuxji commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
tuxji commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r941373667


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -17,24 +17,27 @@
 
 package org.apache.daffodil.infoset
 
+import java.io.StringWriter
+import java.nio.charset.StandardCharsets
+import javax.xml.XMLConstants
+import javax.xml.stream.XMLInputFactory
+import javax.xml.stream.XMLStreamConstants._
+import javax.xml.stream.XMLStreamException
+import javax.xml.stream.XMLStreamReader
+import javax.xml.stream.XMLStreamWriter
+import javax.xml.stream.util.XMLEventAllocator
+
+import org.apache.daffodil.dpath.NodeInfo
 import org.apache.daffodil.exceptions.Assert
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
 import org.apache.daffodil.util.Misc
 import org.apache.daffodil.xml.XMLUtils
-import org.apache.daffodil.util.MaybeBoolean
-import org.apache.daffodil.dpath.NodeInfo
-import org.apache.daffodil.infoset.InfosetInputterEventType._
-
-import javax.xml.stream.XMLStreamReader
-import javax.xml.stream.XMLStreamConstants._
-import javax.xml.stream.XMLInputFactory
-import javax.xml.stream.util.XMLEventAllocator
-import javax.xml.stream.XMLStreamException
-import javax.xml.XMLConstants
 
-object XMLTextInfosetInputter {
+object XMLTextInfoset {

Review Comment:
   What's the reason for the name change?  It might be nice to put a one-line comment above here explaining the object's purpose.



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -452,7 +466,7 @@ class InfosetWalker private (
       if (child.isSimple) {
         if (!child.isHidden || walkHidden) {
           val simple = child.asInstanceOf[DISimple]
-          outputter.startSimple(simple)
+          doOutputter(outputter.startSimple(simple), "start infoset simple element", simple.erd)
           outputter.endSimple(simple)

Review Comment:
   Does `outputter.endSimple(simple)` need a `doOutputter` wrapper too?
   
   All in all, it seems like a pain to have to wrap all the infosetinputter calls.  It would seem simpler to let InfosetInputter throw an exception and let the exception propagate outside of this method without having to catch it and handle it.



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline
+  private def doOutputter(outputterFunc: => Boolean, desc: String, context: ThrowsSDE): Unit = {
+    Try(outputterFunc) match {
+      case Success(true) => // success
+      case Success(false) => context.SDE("Failed to %s", desc)
+      case Failure(e) => context.SDE("Failed to %s: %s", desc, e.toString)

Review Comment:
   Note that `e.toString` only prints the exception's name and message.  If the exception has a good chance of containing a real reason exception object itself, you should print that real reason object in the SDE message or simply throw the whole exception encapsulated by XMLTextInfosetOutputterFailure(e) to ensure the user sees the real reason for the failure.



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline
+  private def doOutputter(outputterFunc: => Boolean, desc: String, context: ThrowsSDE): Unit = {
+    Try(outputterFunc) match {
+      case Success(true) => // success
+      case Success(false) => context.SDE("Failed to %s", desc)

Review Comment:
   I'll defer to your opinions on changing the InfosetInputter API since I don't know how important backward compatibility is to Daffodil users.  However, I agree that returning a boolean false and throwing an exception are redundant ways of informing the caller of failure.  Furthermore, an exception can contain a message and a real reason exception object while a boolean false contains no additional information, leaving the user puzzled and wondering why a failure happened unless the InfosetInputter object also has the Diagnostics trait (which means it's a mutable object with all the problems of mutable state).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r941429056


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline
+  private def doOutputter(outputterFunc: => Boolean, desc: String, context: ThrowsSDE): Unit = {
+    Try(outputterFunc) match {
+      case Success(true) => // success
+      case Success(false) => context.SDE("Failed to %s", desc)

Review Comment:
   Agreed. I don't know of any custom InfosetOutputters, so probably better to break the API now before any are created. Unless there are objections, I might save that change for another pull request. This was a fairly minor change that was needed to to support failing XML payloads. Improving the API to not use booleans and predicable exception's feels like a larger task.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on PR #819:
URL: https://github.com/apache/daffodil/pull/819#issuecomment-1206448592

   > we output some element that is just the string like <invalidXML>the escapified XML string here</invalidXML> which would be the other branch of the choice
   
   Interesting idea. So an XML Schema that uses this feature would look something like this:
   
   ```xml
   <xs:element name="payload" type="xs:string" dfdlx:runtimeProperty="stringAsXml=true" ... />
   ```
   
   And the schema to validate this would look like this:
   ```xml
   <xs:element name="payload">
     <xs:complexType>
       <xs:choice>
         <xs:element name="stringAsXml">
           <xs:complexType>
             <xs:sequence>
               <xs:element ref="pre:foo" />
             </xs:sequence>
           </xs:complexType>
         </xs:element>
         <xs:element name="invalidXml" type="xs:string" maxLength="0" />
       </xs:choice>
     </xs:complexType>
   </xs:element>
   ```
   
   Where the `pre:foo` reference changes depending on what the actual XML content is.
   
   The only potential consideration I can think of is currently we write the XML directly to the OutputStream. To support the invalidXml idea, we would need to write the XML to a temporary string buffer, and only write it out to the OutputStream if the XMLStreamReader finished successfully. So more overhead depending on the size of the XML payload. I suspect it's not worth worrying about? I imagine in general people using this feature would have relatively small payloads?
   
   Question though is if it's even worth it? Is there value in generating an infoset with escaped invalid XML? Is there more value to immediately error, which would contain specific diagnostics from the XMLStreamReader about where that XML was invalid? I'm not sure of the specific uses cases and what makes the most sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on PR #819:
URL: https://github.com/apache/daffodil/pull/819#issuecomment-1205681053

   > > In your introductory description of the PR, the comment suggests that for an XML string like `<foo>...</foo>` the output XML would be `<foo><stringAsXML xmlns=''><xml>.....</xml></stringAsXML></foo>` but your code of the infoset outputter shows `<stringAsXML xmlns=''><xml><foo>.....</foo></xml></stringAsXML>`, which I think has to be correct. We can't insert elements insde the elements from the XML string, only around them.
   > 
   > Can you point out in the code? I don't _think_ it inserts anything inside the xml, but I might have a bug that I'm not seeing.
   
   The code looks right. It's just this description doesn't match the code. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938375295


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetOutputter.scala:
##########
@@ -107,15 +160,16 @@ class XMLTextInfosetOutputter private (writer: java.io.Writer, pretty: Boolean,
     outputStartTag(simple)
 
     if (!isNilled(simple) && simple.hasValue) {
-      val text =
-        if (simple.erd.optPrimType.get.isInstanceOf[NodeInfo.String.Kind]) {
-          val s = remapped(simple.dataValueAsString)
-          scala.xml.Utility.escape(s)
+      if (simple.erd.optPrimType.get == NodeInfo.String) {
+        val simpleVal = simple.dataValueAsString
+        if (simple.erd.runtimeProperties.get(XMLTextInfoset.stringAsXml) == "true") {
+          writeStringAsXml(simpleVal)
         } else {
-          simple.dataValueAsString
+          writer.write(scala.xml.Utility.escape(remapped(simpleVal)))
         }
-
-      writer.write(text)
+      } else {
+        writer.write(simple.dataValueAsString)
+      }
     }
 

Review Comment:
   I would suggest NOT adding that. False sense of security, etc. 
   
   The feature needed is something that edits and creates the modified XSD from the DFDL automatically. I would suggest this isn't Daffodil doing it but some tool. Out of scope for now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938196120


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -17,24 +17,27 @@
 
 package org.apache.daffodil.infoset
 
+import java.io.StringWriter
+import java.nio.charset.StandardCharsets
+import javax.xml.XMLConstants
+import javax.xml.stream.XMLInputFactory
+import javax.xml.stream.XMLStreamConstants._
+import javax.xml.stream.XMLStreamException
+import javax.xml.stream.XMLStreamReader
+import javax.xml.stream.XMLStreamWriter
+import javax.xml.stream.util.XMLEventAllocator
+
+import org.apache.daffodil.dpath.NodeInfo
 import org.apache.daffodil.exceptions.Assert
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
 import org.apache.daffodil.util.Misc
 import org.apache.daffodil.xml.XMLUtils
-import org.apache.daffodil.util.MaybeBoolean
-import org.apache.daffodil.dpath.NodeInfo
-import org.apache.daffodil.infoset.InfosetInputterEventType._
-
-import javax.xml.stream.XMLStreamReader
-import javax.xml.stream.XMLStreamConstants._
-import javax.xml.stream.XMLInputFactory
-import javax.xml.stream.util.XMLEventAllocator
-import javax.xml.stream.XMLStreamException
-import javax.xml.XMLConstants
 
-object XMLTextInfosetInputter {
+object XMLTextInfoset {
   lazy val xmlInputFactory = {
     val fact = new com.ctc.wstx.stax.WstxInputFactory()
-    fact.setProperty(XMLInputFactory.IS_COALESCING, true)
+    fact.setProperty(XMLInputFactory.IS_COALESCING, false)

Review Comment:
   As I started the implementation I found that coalescing=true caused things like CDATA/whitespace/etc to get lost. So this was an early attempt to get things as close as possible to not changing on unparse. I've since determined that's very difficult for the unparse XML to be different than the original XML for many reasons (this just being one of them), so this could be reverted without problem.
   
   But for normal infoset processing, the property doesn't actually make a difference, since we use `getText()` which does coalesce the simple content regardless of this property.
   
   So in most cases this property doesn't matter, and if set to false we git a slightly more accurate XML to String conversion on unparse.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence merged pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence merged PR #819:
URL: https://github.com/apache/daffodil/pull/819


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on PR #819:
URL: https://github.com/apache/daffodil/pull/819#issuecomment-1205477620

   > In your introductory description of the PR, the comment suggests that for an XML string like `<foo>...</foo>` the output XML would be `<foo><stringAsXML xmlns=''><xml>.....</xml></stringAsXML></foo>` but your code of the infoset outputter shows `<stringAsXML xmlns=''><xml><foo>.....</foo></xml></stringAsXML>`, which I think has to be correct. We can't insert elements insde the elements from the XML string, only around them.
   
   Can you point out in the code? I don't *think* it inserts anything inside the xml, but I might have a bug that I'm not seeing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938202189


##########
daffodil-sapi/src/test/resources/test/sapi/mySchemaStringAsXml.dfdl.xsd:
##########
@@ -0,0 +1,38 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+<schema xmlns="http://www.w3.org/2001/XMLSchema"
+  targetNamespace="http://example.com"
+  xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"
+  xmlns:dfdlx="http://www.ogf.org/dfdl/dfdl-1.0/extensions"
+  xmlns:xs="http://www.w3.org/2001/XMLSchema"
+  xmlns:tns="http://example.com">
+
+  <include schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd"/>
+    
+  <annotation>
+    <appinfo source="http://www.ogf.org/dfdl/">
+      <dfdl:format ref="tns:GeneralFormat" />
+    </appinfo>
+  </annotation>
+  
+  <element name="file" type="xs:string"

Review Comment:
   Yeah, I'll make some tests as suggestion to make sure was test as much of this as possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938766835


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -48,6 +51,72 @@ object XMLTextInfosetInputter {
     fact.setProperty(XMLInputFactory.IS_VALIDATING, false) // no DTD validation
     fact
   }
+
+  lazy val xmlOutputFactory = {
+    val fact = new com.ctc.wstx.stax.WstxOutputFactory()
+    fact
+  }
+
+  /**
+   * When the stringAsXml runtime property is set to true we treat the simple
+   * content as if it were XML and output it to the infoset. We use this value
+   * as the runtime property name. Additionally, when we output the XML we need
+   * a wrapper element to reset the default XML namespace in case the infoset
+   * has a default namespace. We also use this variable as the name of that
+   * wrapper element.
+   */
+  lazy val stringAsXml = "stringAsXml"
+
+  /**
+   * Read an event from an XMLStreamReader and write the same event to an
+   * XMLStreamWriter. For example, this can be used for convert XML from one
+   * type of input to a different type (e.g. InputStream to String), check for
+   * well-formed XML, and perform some basic normalization. Example usage of
+   * this function is to call this in a loop like this:
+   *
+   *   while (xsr.hasNext()) {
+   *     XMLTextInfoset.writeStreamEvent(xsr, xsw)
+   *     xsr.next()
+   *   }
+   *
+   * This approach allows callers to skip certain events by conditionally
+   * calling this function.
+   */
+  def writeXMLStreamEvent(xsr: XMLStreamReader, xsw: XMLStreamWriter): Unit = {
+    xsr.getEventType() match {
+      case START_ELEMENT => {
+        xsw.writeStartElement(
+          xsr.getPrefix(),
+          xsr.getLocalName(),
+          xsr.getNamespaceURI())
+        for (i <- 0 until xsr.getNamespaceCount()) {
+          xsw.writeNamespace(
+            xsr.getNamespacePrefix(i),
+            xsr.getNamespaceURI(i))
+        }
+        for (i <- 0 until xsr.getAttributeCount()) {
+          xsw.writeAttribute(
+            xsr.getAttributePrefix(i),
+            xsr.getAttributeNamespace(i),
+            xsr.getAttributeLocalName(i),
+            xsr.getAttributeValue(i))
+        }
+      }
+      case END_ELEMENT => xsw.writeEndElement()
+      case CHARACTERS => xsw.writeCharacters(xsr.getText())
+      case COMMENT => xsw.writeComment(xsr.getText())
+      case CDATA => xsw.writeCData(xsr.getText())
+      case PROCESSING_INSTRUCTION => xsw.writeProcessingInstruction(xsr.getPITarget(), xsr.getPIData())
+      case ENTITY_REFERENCE => xsw.writeEntityRef(xsr.getLocalName())

Review Comment:
   After a little testing and reading documentatino, it looks like coalescing turned off preserves CHARACTER, CDATA, SPACE, and ENTITY_REFERENCE events. When on, any of those contiguous events are combined into a single CHARACTER event.
   
   So in addition to ENTITY_REFERENCES, we'll also preserve CDATA and SPACE (though I've never seen a SPACE event in my testing).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on PR #819:
URL: https://github.com/apache/daffodil/pull/819#issuecomment-1206577611

   
   > Question though is if it's even worth it? Is there value in generating an infoset with escaped invalid XML? ...
   
   I think not worth it, at least for now. We need to document that it is a RSDE if the XML is not well-formed. 
   
   We can add new variants like stringAsXMLSafe="true" in the future with different error behavior such as tolerating malformed if needed. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on PR #819:
URL: https://github.com/apache/daffodil/pull/819#issuecomment-1205683613

   We also need a negative test of non-well-formed XML showing we get an error. 
   
   Also, I think non-well-formed XML should be a parse error, not a runtime SDE. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938902622


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -462,9 +476,11 @@ class InfosetWalker private (
         // must be complex or array, exact same logic for both
         if (!child.isHidden || walkHidden) {
           if (child.isComplex) {
-            outputter.startComplex(child.asInstanceOf[DIComplex])
+            val complex = child.asInstanceOf[DIComplex]
+            doOutputter(outputter.startComplex(complex), "start infoset complex eement", complex.erd)

Review Comment:
   typo "eement"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938376516


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -17,24 +17,27 @@
 
 package org.apache.daffodil.infoset
 
+import java.io.StringWriter
+import java.nio.charset.StandardCharsets
+import javax.xml.XMLConstants
+import javax.xml.stream.XMLInputFactory
+import javax.xml.stream.XMLStreamConstants._
+import javax.xml.stream.XMLStreamException
+import javax.xml.stream.XMLStreamReader
+import javax.xml.stream.XMLStreamWriter
+import javax.xml.stream.util.XMLEventAllocator
+
+import org.apache.daffodil.dpath.NodeInfo
 import org.apache.daffodil.exceptions.Assert
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
 import org.apache.daffodil.util.Misc
 import org.apache.daffodil.xml.XMLUtils
-import org.apache.daffodil.util.MaybeBoolean
-import org.apache.daffodil.dpath.NodeInfo
-import org.apache.daffodil.infoset.InfosetInputterEventType._
-
-import javax.xml.stream.XMLStreamReader
-import javax.xml.stream.XMLStreamConstants._
-import javax.xml.stream.XMLInputFactory
-import javax.xml.stream.util.XMLEventAllocator
-import javax.xml.stream.XMLStreamException
-import javax.xml.XMLConstants
 
-object XMLTextInfosetInputter {
+object XMLTextInfoset {
   lazy val xmlInputFactory = {
     val fact = new com.ctc.wstx.stax.WstxInputFactory()
-    fact.setProperty(XMLInputFactory.IS_COALESCING, true)
+    fact.setProperty(XMLInputFactory.IS_COALESCING, false)

Review Comment:
   I do recall reading some doc about coalescing someplace. I think the context was the scala.xml library loader though. Was about whether you want separate events for entities, CDATAs, and regular characters, or to pre-combine them. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r943825332


##########
daffodil-test/src/test/scala/org/apache/daffodil/infoset/TestStringAsXml.scala:
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.daffodil.infoset
+
+import java.io.ByteArrayOutputStream
+import java.io.ByteArrayInputStream
+import java.io.File
+import java.io.InputStream
+import java.net.URI
+import java.nio.charset.StandardCharsets
+import javax.xml.XMLConstants
+import javax.xml.transform.stream.StreamSource
+import javax.xml.validation.SchemaFactory
+
+import org.junit.Test
+import org.junit.Assert._
+
+import org.xml.sax.SAXParseException
+
+import org.apache.commons.io.IOUtils
+
+import org.apache.daffodil.api.DFDL.DataProcessor
+import org.apache.daffodil.api.URISchemaSource
+import org.apache.daffodil.api.ValidationMode
+import org.apache.daffodil.compiler.Compiler
+import org.apache.daffodil.io.InputSourceDataInputStream
+import org.apache.daffodil.util.Misc
+import org.apache.daffodil.Implicits.intercept
+
+
+class TestStringAsXml {
+
+  private def compileSchema(dfdlSchemaURI: URI) = {
+    val c = Compiler()
+    val pf = c.compileSource(URISchemaSource(dfdlSchemaURI))
+    val dp = pf.onPath("/")
+    dp.withValidationMode(ValidationMode.Full)
+  }
+
+  private def doParse(dp: DataProcessor, data: InputStream) = {
+    val parseIn = InputSourceDataInputStream(data)
+    val parseBos = new ByteArrayOutputStream()
+    val parseOut = new XMLTextInfosetOutputter(parseBos, pretty = true)
+    val parseRes = dp.parse(parseIn, parseOut)
+    val parseDiags = parseRes.getDiagnostics.map(_.toString)
+    val parseStrOpt = if (parseRes.isProcessingError) None else Some(parseBos.toString)
+    (parseDiags, parseStrOpt)
+  }
+
+  private def doUnparse(dp: DataProcessor, infoset: InputStream) = {
+    val unparseIn = new XMLTextInfosetInputter(infoset)
+    val unparseBos = new ByteArrayOutputStream()
+    val unparseOut = java.nio.channels.Channels.newChannel(unparseBos)
+    val unparseRes = dp.unparse(unparseIn, unparseOut)
+    val unparseDiags = unparseRes.getDiagnostics.map(_.toString)
+    val unparseStrOpt = if (unparseRes.isProcessingError) None else Some(unparseBos.toString)
+    (unparseDiags, unparseStrOpt)
+  }
+
+  @Test def test_stringAsXml_01(): Unit = {
+    val dp = compileSchema(Misc.getRequiredResource("/org/apache/daffodil/infoset/stringAsXml/namespaced/xsd/binMessage.dfdl.xsd"))
+    val parseData = Misc.getRequiredResource("/org/apache/daffodil/infoset/stringAsXml/namespaced/binMessage_01.dat").toURL.openStream
+    val (parseDiags, Some(parseInfosetActual)) = doParse(dp, parseData)
+    val parseInfosetExpected = {
+      val is = Misc.getRequiredResource("/org/apache/daffodil/infoset/stringAsXml/namespaced/binMessage_01.dat.xml").toURL.openStream
+      IOUtils.toString(is, StandardCharsets.UTF_8)
+    }
+    // diagnostic from full validation
+    assertTrue(parseDiags.find(_.contains("Element 'xmlStr' is a simple type")).isDefined)
+    // diagnostic from limited validation
+    assertTrue(parseDiags.find(_.contains("xmlStr failed facet checks due to: facet maxLength")).isDefined)
+    // we still get the expected infoset, replace CRLF with LF because of git windows autocrlf
+    assertEquals(parseInfosetExpected.replace("\r\n", "\n"), parseInfosetActual.replace("\r\n", "\n"))
+
+    // validate the infoset using the handwritten WithPayload schema
+    val xsdFile = new File(Misc.getRequiredResource("/org/apache/daffodil/infoset/stringAsXml/namespaced/xsd/binMessageWithXmlPayload.xsd"))
+    val xsdFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI)
+    val xsd = xsdFactory.newSchema(xsdFile)
+    val source = new StreamSource(new ByteArrayInputStream(parseInfosetActual.getBytes(StandardCharsets.UTF_8)))
+    val validator = xsd.newValidator()
+    val e = intercept[SAXParseException] {
+      validator.validate(source)
+    }
+    assertTrue(e.toString.contains("Value '=invalid field' is not facet-valid"))
+  }
+
+  @Test def test_stringAsXml_02(): Unit = {
+    val dp = compileSchema(Misc.getRequiredResource("/org/apache/daffodil/infoset/stringAsXml/namespaced/xsd/binMessage.dfdl.xsd"))
+    val unparseInfoset = Misc.getRequiredResource("/org/apache/daffodil/infoset/stringAsXml/namespaced/binMessage_01.dat.xml").toURL.openStream
+    val (_, Some(unparseDataActual)) = doUnparse(dp, unparseInfoset)
+    val unparseDataExpected = {
+      val is = Misc.getRequiredResource("/org/apache/daffodil/infoset/stringAsXml/namespaced/binMessage_01.dat.xml.dat").toURL.openStream

Review Comment:
   It's my way of saying the original is `foo.dat`, the expected infoset created when parsing that file is `foo.dat.xml`, and the expected data when unparsing that is `foo.dat.xml.dat`. So `foo.dat` and `foo.dat.xml.dat` are functionally the same, but the latter has canonicalized XML in the payload.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938752704


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline

Review Comment:
   Looking at the bytecode, I've confirmed that the function is not actually being inlined. And it looks like it is creating an anonymous `Function0`. I'll remove as this clearly doesn't do anything.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on PR #819:
URL: https://github.com/apache/daffodil/pull/819#issuecomment-1205954194

   > > We also need a negative test of non-well-formed XML showing we get an error.
   > 
   > In the Scala API tests I added we do have two negative tests, one for parsing where the parsed data is not valid XML, and one for parsing where the infoset is not valid XML.
   > 
   > > Also, I think non-well-formed XML should be a parse error, not a runtime SDE.
   > 
   > This actually raises a good point. The XMLTextInfosetOutputter is the thing that converts the string to XML. But that conversion doesn't necessarily happen when the string was originally parsed. InfosetOutputeer events are only triggered after all previous points of uncertainty are resolved. So it's very possible that we're long passed where the string was parsed, and so throwing a ParseError would cause backtracking at some random point. So I think any error thrown by an InfosetOutputter has to be considered fatal and stop the parse immediately. It's impossible to correctly recover or backtrack. Maybe an SDE isn't the right error and we need something new, but I don't think it can be a ParseError. Maybe we have a new "InfosetOutputter" error or something unique to the situation?
   
   Yeah this is a harder issue than I thought. I guess RSDE is the closest thing we have for now. 
   
   I'm thinking what we would need is instead of replacing the xmlString with only a complexType element with the right sub-structure, we instead put that inside a choice in the combined schema. When we construct the XML we either create the expected XML (one branch of the choice would validate that), or if that fails (and we catch the exception in the infoset outputter) we output some element that is just the string like ```<invalidXML>the escapified XML string here</invalidXML>``` which would be the other branch of the choice, and would always be invalid as it would be required to be maxLength 0 or something so it is forcibly always invalid.  
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on PR #819:
URL: https://github.com/apache/daffodil/pull/819#issuecomment-1213271902

   > although it might be worth updating the Confluence page 
   
   Good call, I'll update that page shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on PR #819:
URL: https://github.com/apache/daffodil/pull/819#issuecomment-1205687573

   Question: can I use stringAsXML=true on an element with dfdl:inputValueCalc?
   
   I ask because I expect someone will say they want to use this feature, but cannot tolerate the canonicalization of the XML. 
   
   In that case I think they'd have to parse the string as an ordinary string, have an element with dfdl:inputValueCalc the value of which just takes its string value from the ordinary string, but this second element has stringAsXML=true. 
   
   That way they can validate against the combined schema XSD, but the unparse would take from the original string. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r941437622


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline
+  private def doOutputter(outputterFunc: => Boolean, desc: String, context: ThrowsSDE): Unit = {
+    Try(outputterFunc) match {
+      case Success(true) => // success
+      case Success(false) => context.SDE("Failed to %s", desc)

Review Comment:
   Perhaps the version that we expect to deprecate should just Assert.notImplemented(...)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938913866


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline
+  private def doOutputter(outputterFunc: => Boolean, desc: String, context: ThrowsSDE): Unit = {
+    Try(outputterFunc) match {
+      case Success(true) => // success
+      case Success(false) => context.SDE("Failed to %s", desc)

Review Comment:
   Can this case even happen?  Add comment about how.



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline
+  private def doOutputter(outputterFunc: => Boolean, desc: String, context: ThrowsSDE): Unit = {
+    Try(outputterFunc) match {
+      case Success(true) => // success
+      case Success(false) => context.SDE("Failed to %s", desc)
+      case Failure(e) => context.SDE("Failed to %s: %s", desc, e.toString)

Review Comment:
   Did you consider whether this should just be encapsulating the error as a XMLTextInfosetOutputterFailure(e) and throwing that to the caller inside Daffodil?
   
   I'm not sure that is better, just wondering. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938928445


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline
+  private def doOutputter(outputterFunc: => Boolean, desc: String, context: ThrowsSDE): Unit = {
+    Try(outputterFunc) match {
+      case Success(true) => // success
+      case Success(false) => context.SDE("Failed to %s", desc)

Review Comment:
   The InfosetInputter API has it so the functions either return a boolean to indicate success/fail, or throw an exception. I'm not sure if that really makes sense, but it is what it is.
   
   We could deprecate that behavior and a change the interface to be void functions and only expect no return value on success and an exception on failure. I'm sure no custom infoset inputters rely returning true/false since we never handled the case before, but it would change the InfosetInputter interface and so its not backwards compatible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on PR #819:
URL: https://github.com/apache/daffodil/pull/819#issuecomment-1205464299

   In your introductory description of the PR, the comment suggests that for an XML string like ```<foo>...</foo>``` the output XML would be ```<foo><stringAsXML xmlns=''>.....</stringAsXML></foo>``` but your code of the infoset outputter shows
   ```<stringAsXML xmlns=''><foo>.....</foo></stringAsXML>```, which has to be correct. 
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938197828


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetOutputter.scala:
##########
@@ -99,6 +102,56 @@ class XMLTextInfosetOutputter private (writer: java.io.Writer, pretty: Boolean,
     writer.write(">")
   }
 
+  private def writeStringAsXml(str: String): Unit = {
+    // create a wrapper element that allows us to reset the default XML
+    // namespace. This ensures the embedded XML does not inherit the default
+    // namespaces if one is defined in the infoset
+    incrementIndentation()
+    if (pretty) {
+      writer.write(System.lineSeparator())
+      outputIndentation(writer)
+    }
+    writer.write("<")
+    writer.write(XMLTextInfoset.stringAsXml)
+    writer.write(" xmlns=\"\">")
+

Review Comment:
   The `XMLTextInfoset.stringAsXML` variable is to create the wrapper element `<stringAsXml xmlns="">`. The `<xml>` thing doesn't actualy show up in code anywhere, that was just an example as if the data that was parsed contained the string `<xml>...</xml>`. That can be confusing so I'll change that to something else.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938206553


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -17,24 +17,27 @@
 
 package org.apache.daffodil.infoset
 
+import java.io.StringWriter
+import java.nio.charset.StandardCharsets
+import javax.xml.XMLConstants
+import javax.xml.stream.XMLInputFactory
+import javax.xml.stream.XMLStreamConstants._
+import javax.xml.stream.XMLStreamException
+import javax.xml.stream.XMLStreamReader
+import javax.xml.stream.XMLStreamWriter
+import javax.xml.stream.util.XMLEventAllocator
+
+import org.apache.daffodil.dpath.NodeInfo
 import org.apache.daffodil.exceptions.Assert
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
 import org.apache.daffodil.util.Misc
 import org.apache.daffodil.xml.XMLUtils
-import org.apache.daffodil.util.MaybeBoolean
-import org.apache.daffodil.dpath.NodeInfo
-import org.apache.daffodil.infoset.InfosetInputterEventType._
-
-import javax.xml.stream.XMLStreamReader
-import javax.xml.stream.XMLStreamConstants._
-import javax.xml.stream.XMLInputFactory
-import javax.xml.stream.util.XMLEventAllocator
-import javax.xml.stream.XMLStreamException
-import javax.xml.XMLConstants
 
-object XMLTextInfosetInputter {
+object XMLTextInfoset {
   lazy val xmlInputFactory = {
     val fact = new com.ctc.wstx.stax.WstxInputFactory()
-    fact.setProperty(XMLInputFactory.IS_COALESCING, true)
+    fact.setProperty(XMLInputFactory.IS_COALESCING, false)

Review Comment:
   I also noticed it with perserving CDATA events, there might be others. Maybe there's some good documenation about what Coalescing actually does.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938203856


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline

Review Comment:
   I was trying to avoid scala creating an anonymous function for the by-name. Maybe that is just a pre-optimization though and maybe doesn't even matter. I'm not even exactly sure how scala implements by-name. I can remove this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938939972


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline
+  private def doOutputter(outputterFunc: => Boolean, desc: String, context: ThrowsSDE): Unit = {
+    Try(outputterFunc) match {
+      case Success(true) => // success
+      case Success(false) => context.SDE("Failed to %s", desc)

Review Comment:
   Yes, that API is bothersome. They should return Unit (no value) on success, throw on failure. 
   
   They should throw a predictable InfosetInputterException or InfosetOutputterException respectively which encapsulates the real reason exception object. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r937983759


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -17,24 +17,27 @@
 
 package org.apache.daffodil.infoset
 
+import java.io.StringWriter
+import java.nio.charset.StandardCharsets
+import javax.xml.XMLConstants
+import javax.xml.stream.XMLInputFactory
+import javax.xml.stream.XMLStreamConstants._
+import javax.xml.stream.XMLStreamException
+import javax.xml.stream.XMLStreamReader
+import javax.xml.stream.XMLStreamWriter
+import javax.xml.stream.util.XMLEventAllocator
+
+import org.apache.daffodil.dpath.NodeInfo
 import org.apache.daffodil.exceptions.Assert
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
 import org.apache.daffodil.util.Misc
 import org.apache.daffodil.xml.XMLUtils
-import org.apache.daffodil.util.MaybeBoolean
-import org.apache.daffodil.dpath.NodeInfo
-import org.apache.daffodil.infoset.InfosetInputterEventType._
-
-import javax.xml.stream.XMLStreamReader
-import javax.xml.stream.XMLStreamConstants._
-import javax.xml.stream.XMLInputFactory
-import javax.xml.stream.util.XMLEventAllocator
-import javax.xml.stream.XMLStreamException
-import javax.xml.XMLConstants
 
-object XMLTextInfosetInputter {
+object XMLTextInfoset {
   lazy val xmlInputFactory = {
     val fact = new com.ctc.wstx.stax.WstxInputFactory()
-    fact.setProperty(XMLInputFactory.IS_COALESCING, true)
+    fact.setProperty(XMLInputFactory.IS_COALESCING, false)

Review Comment:
   Why? Needs comment.



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetOutputter.scala:
##########
@@ -107,15 +160,16 @@ class XMLTextInfosetOutputter private (writer: java.io.Writer, pretty: Boolean,
     outputStartTag(simple)
 
     if (!isNilled(simple) && simple.hasValue) {
-      val text =
-        if (simple.erd.optPrimType.get.isInstanceOf[NodeInfo.String.Kind]) {
-          val s = remapped(simple.dataValueAsString)
-          scala.xml.Utility.escape(s)
+      if (simple.erd.optPrimType.get == NodeInfo.String) {
+        val simpleVal = simple.dataValueAsString
+        if (simple.erd.runtimeProperties.get(XMLTextInfoset.stringAsXml) == "true") {
+          writeStringAsXml(simpleVal)
         } else {
-          simple.dataValueAsString
+          writer.write(scala.xml.Utility.escape(remapped(simpleVal)))
         }
-
-      writer.write(text)
+      } else {
+        writer.write(simple.dataValueAsString)
+      }
     }
 

Review Comment:
   Can you explain what happens if validation is 'limited' or if validation is 'on'. Will that still work? I think limited would still work, but "on" should fail as this XML output will not have a string where one is expected. 



##########
daffodil-sapi/src/test/resources/test/sapi/mySchemaStringAsXml.dfdl.xsd:
##########
@@ -0,0 +1,38 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+<schema xmlns="http://www.w3.org/2001/XMLSchema"
+  targetNamespace="http://example.com"
+  xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"
+  xmlns:dfdlx="http://www.ogf.org/dfdl/dfdl-1.0/extensions"
+  xmlns:xs="http://www.w3.org/2001/XMLSchema"
+  xmlns:tns="http://example.com">
+
+  <include schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd"/>
+    
+  <annotation>
+    <appinfo source="http://www.ogf.org/dfdl/">
+      <dfdl:format ref="tns:GeneralFormat" />
+    </appinfo>
+  </annotation>
+  
+  <element name="file" type="xs:string"

Review Comment:
   This test shows the feature working in a minimalist way, but it is not useful as an example of what this feature is really intended to enable. 
   
   I can approve this PR as is, and we can make creating the more illustrative test a separate ticket, or we can just create it now. 
   
   I would like the example to start with  src/test/resources/xmlPayloadSchemaLocation/xmlPayload.xsd, the schema for the payload XML. This schema must be able to be reused without modification. This needs to have multiple attributes declared as well as elements.
   
   Then there is the DFDL schema binMessage.dfdl.xsd, which describes a file containing an **array** of binary data records each with four child elements: someNum, strLen, xmlStr, and anotherNum. strLen should be the length of the string. xmlStr should have an explicit simple type that extends xs:string and has a pattern facet that will fail if validation is limited (since limited validation happens during parsing, so before any of this XML trickery takes place). 
   
   binMessageWithXML.dat should be the binary file.  The XML string inside it should contain all the examples of things that will NOT be preserved by our particular form of canonicalization, and those things that are unexpected (by some people) to be preserved. 
   
   Things we will preserve (as I understand it)
   - Character Entities
   - CDATA regions
   - Comments
   - Processing Instructions (not just the first line, but embedded within the XML)
   - DTD
   
   Things we don't preserve (as I understand it)
   - single vs. double quotes
   - attribute values with leading/trailing/multiple-adjacent spaces
   - maybe order of attributes (not clear if that will change or be preserved)
   - CRLF
   - CR (isolated)
   - choice of which namespace prefix is used (Test data needs to have multiple prefix defs that are for the same namespace, so data goes in with one prefix, comes out with another.)
   
   Default namespace definitions should be used in binMessage.dfdl.xsd, and in xmlPayload.xsd so as to show this is handled properly. Some namespace prefix definitions should also be repeated (for different namespaces) in these two schemas so as to show there is no conflict. 
   
   This XML should contain something that violates a pattern facet in xmlPayload.xsd, like a number out of range for min/maxInclusive facets. 
   
   Then there is binMessageWithXMLPayload.xml which is the expected combined XML output, for comparison with actual. This is created by hand to match what the XMLTextInfosetOutputter produces for the combined XML. 
   
   Then there is binMessageWithXMLPayload.xsd which is the combined XSD that can be used to validate binMessageWithXMLPayload.xml. This schema should xs:import xmlPayload.xsd, and reference its root element as a child of the xmlStr element, which is not an xs:string, but a complex type having one child, an element ref to the root of xmlPayload.xsd. Comments should explain how xmlStr's definition has been replaced, blah blah. 
   
   It is important to illustrate (and point out in comments) that xmlPayload.xsd is reused like this  by binMessageWithXMLPayload.xsd without change. 
   
   The scala code should illustrate 
   
   - the DFDL parse (with 'limited' validation, which should output that the pattern isn't matched as a validation error, as a way of showing limited validation works). 
   - Then the XML result of the parse should be validated with Xerces or other XSD validation tool against the combined binMessageWithXMLPayload.xsd, which should work and report a validation error (number out of range).
   - Then the unparse should create output1 but the output1 should have a canonicalized XML string, and so not matching the input exactly because the xmlStr isn't the same. Comparison should show this. 
   - Then we parse this output1 data again. The infoset2 XML should again match the expected.
   - Then we unparse infoset2  and the binary output2 shoud match output1 byte for byte. This illustrates that the canonicalization preserved everything that can be preserved about the xmlStr element. 
   
   A second test should use a binary data file containing 2 records, not just one. The XML can be canonical in those, and  we don't need it to cause the validation errors. It's just showing that the inputter/outputter are managing state right, and can handle data that has multiple of these stringAsXML strings in it. 



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -48,6 +51,72 @@ object XMLTextInfosetInputter {
     fact.setProperty(XMLInputFactory.IS_VALIDATING, false) // no DTD validation
     fact
   }
+
+  lazy val xmlOutputFactory = {
+    val fact = new com.ctc.wstx.stax.WstxOutputFactory()
+    fact
+  }
+
+  /**
+   * When the stringAsXml runtime property is set to true we treat the simple
+   * content as if it were XML and output it to the infoset. We use this value
+   * as the runtime property name. Additionally, when we output the XML we need
+   * a wrapper element to reset the default XML namespace in case the infoset
+   * has a default namespace. We also use this variable as the name of that
+   * wrapper element.
+   */
+  lazy val stringAsXml = "stringAsXml"
+
+  /**
+   * Read an event from an XMLStreamReader and write the same event to an
+   * XMLStreamWriter. For example, this can be used for convert XML from one
+   * type of input to a different type (e.g. InputStream to String), check for
+   * well-formed XML, and perform some basic normalization. Example usage of
+   * this function is to call this in a loop like this:
+   *
+   *   while (xsr.hasNext()) {
+   *     XMLTextInfoset.writeStreamEvent(xsr, xsw)
+   *     xsr.next()
+   *   }
+   *
+   * This approach allows callers to skip certain events by conditionally
+   * calling this function.
+   */
+  def writeXMLStreamEvent(xsr: XMLStreamReader, xsw: XMLStreamWriter): Unit = {
+    xsr.getEventType() match {
+      case START_ELEMENT => {
+        xsw.writeStartElement(
+          xsr.getPrefix(),
+          xsr.getLocalName(),
+          xsr.getNamespaceURI())
+        for (i <- 0 until xsr.getNamespaceCount()) {
+          xsw.writeNamespace(
+            xsr.getNamespacePrefix(i),
+            xsr.getNamespaceURI(i))
+        }
+        for (i <- 0 until xsr.getAttributeCount()) {
+          xsw.writeAttribute(
+            xsr.getAttributePrefix(i),
+            xsr.getAttributeNamespace(i),
+            xsr.getAttributeLocalName(i),
+            xsr.getAttributeValue(i))
+        }
+      }
+      case END_ELEMENT => xsw.writeEndElement()
+      case CHARACTERS => xsw.writeCharacters(xsr.getText())
+      case COMMENT => xsw.writeComment(xsr.getText())
+      case CDATA => xsw.writeCData(xsr.getText())
+      case PROCESSING_INSTRUCTION => xsw.writeProcessingInstruction(xsr.getPITarget(), xsr.getPIData())
+      case ENTITY_REFERENCE => xsw.writeEntityRef(xsr.getLocalName())

Review Comment:
   If I am understanding this properly, a comment here about coalescing being false so entity refs are preserved is important. 



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -48,6 +51,72 @@ object XMLTextInfosetInputter {
     fact.setProperty(XMLInputFactory.IS_VALIDATING, false) // no DTD validation
     fact
   }
+
+  lazy val xmlOutputFactory = {
+    val fact = new com.ctc.wstx.stax.WstxOutputFactory()
+    fact
+  }
+
+  /**
+   * When the stringAsXml runtime property is set to true we treat the simple
+   * content as if it were XML and output it to the infoset. We use this value
+   * as the runtime property name. Additionally, when we output the XML we need
+   * a wrapper element to reset the default XML namespace in case the infoset
+   * has a default namespace. We also use this variable as the name of that
+   * wrapper element.
+   */
+  lazy val stringAsXml = "stringAsXml"
+
+  /**
+   * Read an event from an XMLStreamReader and write the same event to an
+   * XMLStreamWriter. For example, this can be used for convert XML from one
+   * type of input to a different type (e.g. InputStream to String), check for
+   * well-formed XML, and perform some basic normalization. Example usage of
+   * this function is to call this in a loop like this:
+   *
+   *   while (xsr.hasNext()) {
+   *     XMLTextInfoset.writeStreamEvent(xsr, xsw)
+   *     xsr.next()
+   *   }
+   *
+   * This approach allows callers to skip certain events by conditionally
+   * calling this function.
+   */
+  def writeXMLStreamEvent(xsr: XMLStreamReader, xsw: XMLStreamWriter): Unit = {
+    xsr.getEventType() match {
+      case START_ELEMENT => {
+        xsw.writeStartElement(
+          xsr.getPrefix(),
+          xsr.getLocalName(),
+          xsr.getNamespaceURI())
+        for (i <- 0 until xsr.getNamespaceCount()) {
+          xsw.writeNamespace(
+            xsr.getNamespacePrefix(i),
+            xsr.getNamespaceURI(i))
+        }
+        for (i <- 0 until xsr.getAttributeCount()) {
+          xsw.writeAttribute(
+            xsr.getAttributePrefix(i),
+            xsr.getAttributeNamespace(i),
+            xsr.getAttributeLocalName(i),
+            xsr.getAttributeValue(i))
+        }
+      }
+      case END_ELEMENT => xsw.writeEndElement()
+      case CHARACTERS => xsw.writeCharacters(xsr.getText())
+      case COMMENT => xsw.writeComment(xsr.getText())
+      case CDATA => xsw.writeCData(xsr.getText())
+      case PROCESSING_INSTRUCTION => xsw.writeProcessingInstruction(xsr.getPITarget(), xsr.getPIData())
+      case ENTITY_REFERENCE => xsw.writeEntityRef(xsr.getLocalName())
+      case DTD => xsw.writeDTD(xsr.getText())

Review Comment:
   Definitely need a comment (and documentation needs to mention this also) that DTDs are allowed and preserved in the embedded payload XML strings. 



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -48,6 +51,72 @@ object XMLTextInfosetInputter {
     fact.setProperty(XMLInputFactory.IS_VALIDATING, false) // no DTD validation
     fact
   }
+
+  lazy val xmlOutputFactory = {
+    val fact = new com.ctc.wstx.stax.WstxOutputFactory()
+    fact
+  }
+
+  /**
+   * When the stringAsXml runtime property is set to true we treat the simple
+   * content as if it were XML and output it to the infoset. We use this value
+   * as the runtime property name. Additionally, when we output the XML we need
+   * a wrapper element to reset the default XML namespace in case the infoset
+   * has a default namespace. We also use this variable as the name of that
+   * wrapper element.
+   */
+  lazy val stringAsXml = "stringAsXml"
+
+  /**
+   * Read an event from an XMLStreamReader and write the same event to an
+   * XMLStreamWriter. For example, this can be used for convert XML from one
+   * type of input to a different type (e.g. InputStream to String), check for
+   * well-formed XML, and perform some basic normalization. Example usage of
+   * this function is to call this in a loop like this:
+   *
+   *   while (xsr.hasNext()) {
+   *     XMLTextInfoset.writeStreamEvent(xsr, xsw)
+   *     xsr.next()
+   *   }
+   *
+   * This approach allows callers to skip certain events by conditionally
+   * calling this function.
+   */
+  def writeXMLStreamEvent(xsr: XMLStreamReader, xsw: XMLStreamWriter): Unit = {
+    xsr.getEventType() match {
+      case START_ELEMENT => {
+        xsw.writeStartElement(
+          xsr.getPrefix(),
+          xsr.getLocalName(),
+          xsr.getNamespaceURI())
+        for (i <- 0 until xsr.getNamespaceCount()) {
+          xsw.writeNamespace(
+            xsr.getNamespacePrefix(i),
+            xsr.getNamespaceURI(i))
+        }
+        for (i <- 0 until xsr.getAttributeCount()) {
+          xsw.writeAttribute(
+            xsr.getAttributePrefix(i),
+            xsr.getAttributeNamespace(i),
+            xsr.getAttributeLocalName(i),
+            xsr.getAttributeValue(i))
+        }
+      }
+      case END_ELEMENT => xsw.writeEndElement()
+      case CHARACTERS => xsw.writeCharacters(xsr.getText())
+      case COMMENT => xsw.writeComment(xsr.getText())

Review Comment:
   So we are able to preserve comments and processing instructions. That's pretty interesting. I wonder if whitespace within those will be preserved or collapsed/modified. Our test data embedded XML string should test this to see. More on this in later comment about test enhancement. 
   
   I see below even entity references seem like they're being preserved. 
   
   So is this why the coalescing 'false' setting is used?
   
   Of interesting note, if we created a corresponding feature for the EXI inputter/outputter, it would NOT be able to preserve many of these things, as EXI itself doesn't preserve many of them. 



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetOutputter.scala:
##########
@@ -99,6 +102,56 @@ class XMLTextInfosetOutputter private (writer: java.io.Writer, pretty: Boolean,
     writer.write(">")
   }
 
+  private def writeStringAsXml(str: String): Unit = {
+    // create a wrapper element that allows us to reset the default XML
+    // namespace. This ensures the embedded XML does not inherit the default
+    // namespaces if one is defined in the infoset
+    incrementIndentation()
+    if (pretty) {
+      writer.write(System.lineSeparator())
+      outputIndentation(writer)
+    }
+    writer.write("<")
+    writer.write(XMLTextInfoset.stringAsXml)
+    writer.write(" xmlns=\"\">")
+

Review Comment:
   Your intro text also showed an ```<xml>.... </xml>``` added enclosing element in the mix. I don't see that in this code. 



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -17,24 +17,27 @@
 
 package org.apache.daffodil.infoset
 
+import java.io.StringWriter
+import java.nio.charset.StandardCharsets
+import javax.xml.XMLConstants
+import javax.xml.stream.XMLInputFactory
+import javax.xml.stream.XMLStreamConstants._
+import javax.xml.stream.XMLStreamException
+import javax.xml.stream.XMLStreamReader
+import javax.xml.stream.XMLStreamWriter
+import javax.xml.stream.util.XMLEventAllocator
+
+import org.apache.daffodil.dpath.NodeInfo
 import org.apache.daffodil.exceptions.Assert
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
 import org.apache.daffodil.util.Misc
 import org.apache.daffodil.xml.XMLUtils
-import org.apache.daffodil.util.MaybeBoolean
-import org.apache.daffodil.dpath.NodeInfo
-import org.apache.daffodil.infoset.InfosetInputterEventType._
-
-import javax.xml.stream.XMLStreamReader
-import javax.xml.stream.XMLStreamConstants._
-import javax.xml.stream.XMLInputFactory
-import javax.xml.stream.util.XMLEventAllocator
-import javax.xml.stream.XMLStreamException
-import javax.xml.XMLConstants
 
-object XMLTextInfosetInputter {
+object XMLTextInfoset {
   lazy val xmlInputFactory = {
     val fact = new com.ctc.wstx.stax.WstxInputFactory()
-    fact.setProperty(XMLInputFactory.IS_COALESCING, true)
+    fact.setProperty(XMLInputFactory.IS_COALESCING, false)

Review Comment:
   Need comment explaining why coalescing is false here. I think that's about preserving entity references, but I'm not certain. 



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline

Review Comment:
   Much too big to inline this. Why inline something with Try and case matching ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938201581


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetOutputter.scala:
##########
@@ -107,15 +160,16 @@ class XMLTextInfosetOutputter private (writer: java.io.Writer, pretty: Boolean,
     outputStartTag(simple)
 
     if (!isNilled(simple) && simple.hasValue) {
-      val text =
-        if (simple.erd.optPrimType.get.isInstanceOf[NodeInfo.String.Kind]) {
-          val s = remapped(simple.dataValueAsString)
-          scala.xml.Utility.escape(s)
+      if (simple.erd.optPrimType.get == NodeInfo.String) {
+        val simpleVal = simple.dataValueAsString
+        if (simple.erd.runtimeProperties.get(XMLTextInfoset.stringAsXml) == "true") {
+          writeStringAsXml(simpleVal)
         } else {
-          simple.dataValueAsString
+          writer.write(scala.xml.Utility.escape(remapped(simpleVal)))
         }
-
-      writer.write(text)
+      } else {
+        writer.write(simple.dataValueAsString)
+      }
     }
 

Review Comment:
   `validation="limited"` should work just like normal, as if stringAsXml didn't exist. This is because the "limited" validation we do uses the values on our internal infoset representation.
   
   `validation="on"` will fail, no matter what infoset outputter someone uses. This is because with this mode, internally we `tee` the infoset events to both the InfosetOutputter supplied by the user and the `XMLTextInfosetOutputter`. And since the XMLTextInfosetOutputter results won't match the schema it will always fail.
   
   We could maybe have an option so that the XMLTextInfosetOutputter will disable stringAsXml, so the user gets a stringAsXml-ified infoset, and our Xerces gets a normal infoset with the string escaped. This would allow validation=on to work, but with the caveat that it's only going to validate the stringAsXml stuff as a normal string and not as XML. Should we add that feature? It would just be a boolean flag in the XMLTextInfosetOutputter, but normal users should probably never set it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on PR #819:
URL: https://github.com/apache/daffodil/pull/819#issuecomment-1205737835

   > We also need a negative test of non-well-formed XML showing we get an error.
   
   In the Scala API tests I added we do have two negative tests, one for parsing where the parsed data is not valid XML, and one for parsing where the infoset is not valid XML.
   
   > Also, I think non-well-formed XML should be a parse error, not a runtime SDE.
   
   This actually raises a good point. The XMLTextInfosetOutputter is the thing that converts the string to XML. But that conversion doesn't necessarily happen when the string was originally parsed. InfosetOutputeer events are only triggered after all previous points of uncertainty are resolved. So it's very possible that we're long passed where the string was parsed, and so throwing a ParseError would cause backtracking at some random point. So I think any error thrown by an InfosetOutputter has to be considered fatal and stop the parse immediately. It's impossible to correctly recover or backtrack. Maybe an SDE isn't the right error and we need something new, but I don't think it can be a ParseError. Maybe we have a new "InfosetOutputter" error or something unique to the situation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r943821771


##########
daffodil-test/src/test/scala/org/apache/daffodil/infoset/TestStringAsXml.scala:
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.daffodil.infoset
+
+import java.io.ByteArrayOutputStream
+import java.io.ByteArrayInputStream
+import java.io.File
+import java.io.InputStream
+import java.net.URI
+import java.nio.charset.StandardCharsets
+import javax.xml.XMLConstants
+import javax.xml.transform.stream.StreamSource
+import javax.xml.validation.SchemaFactory
+
+import org.junit.Test
+import org.junit.Assert._
+
+import org.xml.sax.SAXParseException
+
+import org.apache.commons.io.IOUtils
+
+import org.apache.daffodil.api.DFDL.DataProcessor
+import org.apache.daffodil.api.URISchemaSource
+import org.apache.daffodil.api.ValidationMode
+import org.apache.daffodil.compiler.Compiler
+import org.apache.daffodil.io.InputSourceDataInputStream
+import org.apache.daffodil.util.Misc
+import org.apache.daffodil.Implicits.intercept
+
+
+class TestStringAsXml {
+
+  private def compileSchema(dfdlSchemaURI: URI) = {
+    val c = Compiler()
+    val pf = c.compileSource(URISchemaSource(dfdlSchemaURI))
+    val dp = pf.onPath("/")
+    dp.withValidationMode(ValidationMode.Full)
+  }
+
+  private def doParse(dp: DataProcessor, data: InputStream) = {
+    val parseIn = InputSourceDataInputStream(data)
+    val parseBos = new ByteArrayOutputStream()
+    val parseOut = new XMLTextInfosetOutputter(parseBos, pretty = true)
+    val parseRes = dp.parse(parseIn, parseOut)
+    val parseDiags = parseRes.getDiagnostics.map(_.toString)
+    val parseStrOpt = if (parseRes.isProcessingError) None else Some(parseBos.toString)
+    (parseDiags, parseStrOpt)
+  }
+
+  private def doUnparse(dp: DataProcessor, infoset: InputStream) = {
+    val unparseIn = new XMLTextInfosetInputter(infoset)
+    val unparseBos = new ByteArrayOutputStream()
+    val unparseOut = java.nio.channels.Channels.newChannel(unparseBos)
+    val unparseRes = dp.unparse(unparseIn, unparseOut)
+    val unparseDiags = unparseRes.getDiagnostics.map(_.toString)
+    val unparseStrOpt = if (unparseRes.isProcessingError) None else Some(unparseBos.toString)
+    (unparseDiags, unparseStrOpt)
+  }
+
+  @Test def test_stringAsXml_01(): Unit = {
+    val dp = compileSchema(Misc.getRequiredResource("/org/apache/daffodil/infoset/stringAsXml/namespaced/xsd/binMessage.dfdl.xsd"))
+    val parseData = Misc.getRequiredResource("/org/apache/daffodil/infoset/stringAsXml/namespaced/binMessage_01.dat").toURL.openStream
+    val (parseDiags, Some(parseInfosetActual)) = doParse(dp, parseData)
+    val parseInfosetExpected = {
+      val is = Misc.getRequiredResource("/org/apache/daffodil/infoset/stringAsXml/namespaced/binMessage_01.dat.xml").toURL.openStream
+      IOUtils.toString(is, StandardCharsets.UTF_8)
+    }
+    // diagnostic from full validation
+    assertTrue(parseDiags.find(_.contains("Element 'xmlStr' is a simple type")).isDefined)
+    // diagnostic from limited validation
+    assertTrue(parseDiags.find(_.contains("xmlStr failed facet checks due to: facet maxLength")).isDefined)
+    // we still get the expected infoset, replace CRLF with LF because of git windows autocrlf
+    assertEquals(parseInfosetExpected.replace("\r\n", "\n"), parseInfosetActual.replace("\r\n", "\n"))
+
+    // validate the infoset using the handwritten WithPayload schema
+    val xsdFile = new File(Misc.getRequiredResource("/org/apache/daffodil/infoset/stringAsXml/namespaced/xsd/binMessageWithXmlPayload.xsd"))
+    val xsdFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI)
+    val xsd = xsdFactory.newSchema(xsdFile)
+    val source = new StreamSource(new ByteArrayInputStream(parseInfosetActual.getBytes(StandardCharsets.UTF_8)))
+    val validator = xsd.newValidator()
+    val e = intercept[SAXParseException] {
+      validator.validate(source)
+    }
+    assertTrue(e.toString.contains("Value '=invalid field' is not facet-valid"))
+  }
+
+  @Test def test_stringAsXml_02(): Unit = {
+    val dp = compileSchema(Misc.getRequiredResource("/org/apache/daffodil/infoset/stringAsXml/namespaced/xsd/binMessage.dfdl.xsd"))
+    val unparseInfoset = Misc.getRequiredResource("/org/apache/daffodil/infoset/stringAsXml/namespaced/binMessage_01.dat.xml").toURL.openStream
+    val (_, Some(unparseDataActual)) = doUnparse(dp, unparseInfoset)
+    val unparseDataExpected = {
+      val is = Misc.getRequiredResource("/org/apache/daffodil/infoset/stringAsXml/namespaced/binMessage_01.dat.xml.dat").toURL.openStream

Review Comment:
   Why the funny file name .dat.xml.dat? Are you nesting these things? If so deserves a comment. If not, ordinary naming would be better. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938205888


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -48,6 +51,72 @@ object XMLTextInfosetInputter {
     fact.setProperty(XMLInputFactory.IS_VALIDATING, false) // no DTD validation
     fact
   }
+
+  lazy val xmlOutputFactory = {
+    val fact = new com.ctc.wstx.stax.WstxOutputFactory()
+    fact
+  }
+
+  /**
+   * When the stringAsXml runtime property is set to true we treat the simple
+   * content as if it were XML and output it to the infoset. We use this value
+   * as the runtime property name. Additionally, when we output the XML we need
+   * a wrapper element to reset the default XML namespace in case the infoset
+   * has a default namespace. We also use this variable as the name of that
+   * wrapper element.
+   */
+  lazy val stringAsXml = "stringAsXml"
+
+  /**
+   * Read an event from an XMLStreamReader and write the same event to an
+   * XMLStreamWriter. For example, this can be used for convert XML from one
+   * type of input to a different type (e.g. InputStream to String), check for
+   * well-formed XML, and perform some basic normalization. Example usage of
+   * this function is to call this in a loop like this:
+   *
+   *   while (xsr.hasNext()) {
+   *     XMLTextInfoset.writeStreamEvent(xsr, xsw)
+   *     xsr.next()
+   *   }
+   *
+   * This approach allows callers to skip certain events by conditionally
+   * calling this function.
+   */
+  def writeXMLStreamEvent(xsr: XMLStreamReader, xsw: XMLStreamWriter): Unit = {
+    xsr.getEventType() match {
+      case START_ELEMENT => {
+        xsw.writeStartElement(
+          xsr.getPrefix(),
+          xsr.getLocalName(),
+          xsr.getNamespaceURI())
+        for (i <- 0 until xsr.getNamespaceCount()) {
+          xsw.writeNamespace(
+            xsr.getNamespacePrefix(i),
+            xsr.getNamespaceURI(i))
+        }
+        for (i <- 0 until xsr.getAttributeCount()) {
+          xsw.writeAttribute(
+            xsr.getAttributePrefix(i),
+            xsr.getAttributeNamespace(i),
+            xsr.getAttributeLocalName(i),
+            xsr.getAttributeValue(i))
+        }
+      }
+      case END_ELEMENT => xsw.writeEndElement()
+      case CHARACTERS => xsw.writeCharacters(xsr.getText())
+      case COMMENT => xsw.writeComment(xsr.getText())
+      case CDATA => xsw.writeCData(xsr.getText())
+      case PROCESSING_INSTRUCTION => xsw.writeProcessingInstruction(xsr.getPITarget(), xsr.getPIData())
+      case ENTITY_REFERENCE => xsw.writeEntityRef(xsr.getLocalName())

Review Comment:
   Yep, will do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938375725


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline

Review Comment:
   I did look at the bytecode for this once, and inline did not (at that time) fold this proerly to avoid the by-name thunk. 
   
   Supposedly newer scala (3.0) fixes this. So someday....



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938377726


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -48,6 +51,72 @@ object XMLTextInfosetInputter {
     fact.setProperty(XMLInputFactory.IS_VALIDATING, false) // no DTD validation
     fact
   }
+
+  lazy val xmlOutputFactory = {
+    val fact = new com.ctc.wstx.stax.WstxOutputFactory()
+    fact
+  }
+
+  /**
+   * When the stringAsXml runtime property is set to true we treat the simple
+   * content as if it were XML and output it to the infoset. We use this value
+   * as the runtime property name. Additionally, when we output the XML we need
+   * a wrapper element to reset the default XML namespace in case the infoset
+   * has a default namespace. We also use this variable as the name of that
+   * wrapper element.
+   */
+  lazy val stringAsXml = "stringAsXml"
+
+  /**
+   * Read an event from an XMLStreamReader and write the same event to an
+   * XMLStreamWriter. For example, this can be used for convert XML from one
+   * type of input to a different type (e.g. InputStream to String), check for
+   * well-formed XML, and perform some basic normalization. Example usage of
+   * this function is to call this in a loop like this:
+   *
+   *   while (xsr.hasNext()) {
+   *     XMLTextInfoset.writeStreamEvent(xsr, xsw)
+   *     xsr.next()
+   *   }
+   *
+   * This approach allows callers to skip certain events by conditionally
+   * calling this function.
+   */
+  def writeXMLStreamEvent(xsr: XMLStreamReader, xsw: XMLStreamWriter): Unit = {
+    xsr.getEventType() match {
+      case START_ELEMENT => {
+        xsw.writeStartElement(
+          xsr.getPrefix(),
+          xsr.getLocalName(),
+          xsr.getNamespaceURI())
+        for (i <- 0 until xsr.getNamespaceCount()) {
+          xsw.writeNamespace(
+            xsr.getNamespacePrefix(i),
+            xsr.getNamespaceURI(i))
+        }
+        for (i <- 0 until xsr.getAttributeCount()) {
+          xsw.writeAttribute(
+            xsr.getAttributePrefix(i),
+            xsr.getAttributeNamespace(i),
+            xsr.getAttributeLocalName(i),
+            xsr.getAttributeValue(i))
+        }
+      }
+      case END_ELEMENT => xsw.writeEndElement()
+      case CHARACTERS => xsw.writeCharacters(xsr.getText())
+      case COMMENT => xsw.writeComment(xsr.getText())
+      case CDATA => xsw.writeCData(xsr.getText())
+      case PROCESSING_INSTRUCTION => xsw.writeProcessingInstruction(xsr.getPITarget(), xsr.getPIData())
+      case ENTITY_REFERENCE => xsw.writeEntityRef(xsr.getLocalName())
+      case DTD => xsw.writeDTD(xsr.getText())

Review Comment:
   Hmmm. giving this a bit more thought. A DTD has to be at the start of the XML right? So it doesn't make sense I think now, to preserve a DTD in an XML thunk that is going to be inside another XML document as a child element somewhere. 
   
   I think we just strip out DTDs, and document that we do. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938381835


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -17,24 +17,27 @@
 
 package org.apache.daffodil.infoset
 
+import java.io.StringWriter
+import java.nio.charset.StandardCharsets
+import javax.xml.XMLConstants
+import javax.xml.stream.XMLInputFactory
+import javax.xml.stream.XMLStreamConstants._
+import javax.xml.stream.XMLStreamException
+import javax.xml.stream.XMLStreamReader
+import javax.xml.stream.XMLStreamWriter
+import javax.xml.stream.util.XMLEventAllocator
+
+import org.apache.daffodil.dpath.NodeInfo
 import org.apache.daffodil.exceptions.Assert
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
 import org.apache.daffodil.util.Misc
 import org.apache.daffodil.xml.XMLUtils
-import org.apache.daffodil.util.MaybeBoolean
-import org.apache.daffodil.dpath.NodeInfo
-import org.apache.daffodil.infoset.InfosetInputterEventType._
-
-import javax.xml.stream.XMLStreamReader
-import javax.xml.stream.XMLStreamConstants._
-import javax.xml.stream.XMLInputFactory
-import javax.xml.stream.util.XMLEventAllocator
-import javax.xml.stream.XMLStreamException
-import javax.xml.XMLConstants
 
-object XMLTextInfosetInputter {
+object XMLTextInfoset {
   lazy val xmlInputFactory = {
     val fact = new com.ctc.wstx.stax.WstxInputFactory()
-    fact.setProperty(XMLInputFactory.IS_COALESCING, true)
+    fact.setProperty(XMLInputFactory.IS_COALESCING, false)

Review Comment:
   I recall now. There is/was a command line option -Xxml:-coalescing which we have in our build.sbt. Doc for that is what I had read long time ago. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938924948


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline
+  private def doOutputter(outputterFunc: => Boolean, desc: String, context: ThrowsSDE): Unit = {
+    Try(outputterFunc) match {
+      case Success(true) => // success
+      case Success(false) => context.SDE("Failed to %s", desc)
+      case Failure(e) => context.SDE("Failed to %s: %s", desc, e.toString)

Review Comment:
   That might make more sense, that makes it more clear that the problem is in the infoset inputter which is user provided, and not necessarily a problem with Daffodil. Though, most people are going to use the ones we ship with Daffodil, so it might be unexpected for users to have to catch an exception instead of just get a normal SDE. I don't feel strongly either way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r941432688


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline
+  private def doOutputter(outputterFunc: => Boolean, desc: String, context: ThrowsSDE): Unit = {
+    Try(outputterFunc) match {
+      case Success(true) => // success
+      case Success(false) => context.SDE("Failed to %s", desc)
+      case Failure(e) => context.SDE("Failed to %s: %s", desc, e.toString)

Review Comment:
   I think throwing the exception might cause problems--I don't think parse ever is supposed to throw, we just always convert exceptions to some form of diagnostic or parse/unaprse error. But I can update the SDE to include the reason.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r941436139


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -452,7 +466,7 @@ class InfosetWalker private (
       if (child.isSimple) {
         if (!child.isHidden || walkHidden) {
           val simple = child.asInstanceOf[DISimple]
-          outputter.startSimple(simple)
+          doOutputter(outputter.startSimple(simple), "start infoset simple element", simple.erd)
           outputter.endSimple(simple)

Review Comment:
   Yep it does. Good catch.
   
   Propogating exceptions isn't something we normally do, and I'm afraid it could cause side effects. For example, if an individual parser catches `Exception` (which they shouldn't do but I think happens in a few places)` then this might get silently ignored, or converted to some unrelated failure diagnostic.
   
   But we are good about parsers not catching SDE exceptions, so I think that's safer to ensure we bubbleup to top and stop the parse.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938773519


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -48,6 +51,72 @@ object XMLTextInfosetInputter {
     fact.setProperty(XMLInputFactory.IS_VALIDATING, false) // no DTD validation
     fact
   }
+
+  lazy val xmlOutputFactory = {
+    val fact = new com.ctc.wstx.stax.WstxOutputFactory()
+    fact
+  }
+
+  /**
+   * When the stringAsXml runtime property is set to true we treat the simple
+   * content as if it were XML and output it to the infoset. We use this value
+   * as the runtime property name. Additionally, when we output the XML we need
+   * a wrapper element to reset the default XML namespace in case the infoset
+   * has a default namespace. We also use this variable as the name of that
+   * wrapper element.
+   */
+  lazy val stringAsXml = "stringAsXml"
+
+  /**
+   * Read an event from an XMLStreamReader and write the same event to an
+   * XMLStreamWriter. For example, this can be used for convert XML from one
+   * type of input to a different type (e.g. InputStream to String), check for
+   * well-formed XML, and perform some basic normalization. Example usage of
+   * this function is to call this in a loop like this:
+   *
+   *   while (xsr.hasNext()) {
+   *     XMLTextInfoset.writeStreamEvent(xsr, xsw)
+   *     xsr.next()
+   *   }
+   *
+   * This approach allows callers to skip certain events by conditionally
+   * calling this function.
+   */
+  def writeXMLStreamEvent(xsr: XMLStreamReader, xsw: XMLStreamWriter): Unit = {
+    xsr.getEventType() match {
+      case START_ELEMENT => {
+        xsw.writeStartElement(
+          xsr.getPrefix(),
+          xsr.getLocalName(),
+          xsr.getNamespaceURI())
+        for (i <- 0 until xsr.getNamespaceCount()) {
+          xsw.writeNamespace(
+            xsr.getNamespacePrefix(i),
+            xsr.getNamespaceURI(i))
+        }
+        for (i <- 0 until xsr.getAttributeCount()) {
+          xsw.writeAttribute(
+            xsr.getAttributePrefix(i),
+            xsr.getAttributeNamespace(i),
+            xsr.getAttributeLocalName(i),
+            xsr.getAttributeValue(i))
+        }
+      }
+      case END_ELEMENT => xsw.writeEndElement()
+      case CHARACTERS => xsw.writeCharacters(xsr.getText())
+      case COMMENT => xsw.writeComment(xsr.getText())
+      case CDATA => xsw.writeCData(xsr.getText())
+      case PROCESSING_INSTRUCTION => xsw.writeProcessingInstruction(xsr.getPITarget(), xsr.getPIData())
+      case ENTITY_REFERENCE => xsw.writeEntityRef(xsr.getLocalName())
+      case DTD => xsw.writeDTD(xsr.getText())

Review Comment:
   Yep, makes sense, I'll make that change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938209080


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -48,6 +51,72 @@ object XMLTextInfosetInputter {
     fact.setProperty(XMLInputFactory.IS_VALIDATING, false) // no DTD validation
     fact
   }
+
+  lazy val xmlOutputFactory = {
+    val fact = new com.ctc.wstx.stax.WstxOutputFactory()
+    fact
+  }
+
+  /**
+   * When the stringAsXml runtime property is set to true we treat the simple
+   * content as if it were XML and output it to the infoset. We use this value
+   * as the runtime property name. Additionally, when we output the XML we need
+   * a wrapper element to reset the default XML namespace in case the infoset
+   * has a default namespace. We also use this variable as the name of that
+   * wrapper element.
+   */
+  lazy val stringAsXml = "stringAsXml"
+
+  /**
+   * Read an event from an XMLStreamReader and write the same event to an
+   * XMLStreamWriter. For example, this can be used for convert XML from one
+   * type of input to a different type (e.g. InputStream to String), check for
+   * well-formed XML, and perform some basic normalization. Example usage of
+   * this function is to call this in a loop like this:
+   *
+   *   while (xsr.hasNext()) {
+   *     XMLTextInfoset.writeStreamEvent(xsr, xsw)
+   *     xsr.next()
+   *   }
+   *
+   * This approach allows callers to skip certain events by conditionally
+   * calling this function.
+   */
+  def writeXMLStreamEvent(xsr: XMLStreamReader, xsw: XMLStreamWriter): Unit = {
+    xsr.getEventType() match {
+      case START_ELEMENT => {
+        xsw.writeStartElement(
+          xsr.getPrefix(),
+          xsr.getLocalName(),
+          xsr.getNamespaceURI())
+        for (i <- 0 until xsr.getNamespaceCount()) {
+          xsw.writeNamespace(
+            xsr.getNamespacePrefix(i),
+            xsr.getNamespaceURI(i))
+        }
+        for (i <- 0 until xsr.getAttributeCount()) {
+          xsw.writeAttribute(
+            xsr.getAttributePrefix(i),
+            xsr.getAttributeNamespace(i),
+            xsr.getAttributeLocalName(i),
+            xsr.getAttributeValue(i))
+        }
+      }
+      case END_ELEMENT => xsw.writeEndElement()
+      case CHARACTERS => xsw.writeCharacters(xsr.getText())
+      case COMMENT => xsw.writeComment(xsr.getText())
+      case CDATA => xsw.writeCData(xsr.getText())
+      case PROCESSING_INSTRUCTION => xsw.writeProcessingInstruction(xsr.getPITarget(), xsr.getPIData())
+      case ENTITY_REFERENCE => xsw.writeEntityRef(xsr.getLocalName())
+      case DTD => xsw.writeDTD(xsr.getText())

Review Comment:
   So that's maybe worth considering if we want to support that. I assumed so, but the logic in the infosetter and outputter make it pretty straightforward if there's anything that could cause issues.
   
   For example, if we allow DTD's that define new entities, is it possible that could cause issues in other parts of the infoset if they then use those entities? Before this change, one could very safely use our infosets because they were very limited (only complex types, and simple types, and everything escape correctly). But with this change, infosets are much less safe, one could maybe insert a XML Bomb in the infoset.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on PR #819:
URL: https://github.com/apache/daffodil/pull/819#issuecomment-1205724268

   > Question: can I use stringAsXML=true on an element with dfdl:inputValueCalc?
   
   This should work just fine. The XMLInfosetInputter/Outputter doesn't have any knowledge about whether simple type came from parsed data or from calcuclated data via inputValueCalc. It's all just a string to it. What you suggest is a reasonable approach if allow validation of that XML but also preserve byte-for-byte on unparse.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r938205660


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -48,6 +51,72 @@ object XMLTextInfosetInputter {
     fact.setProperty(XMLInputFactory.IS_VALIDATING, false) // no DTD validation
     fact
   }
+
+  lazy val xmlOutputFactory = {
+    val fact = new com.ctc.wstx.stax.WstxOutputFactory()
+    fact
+  }
+
+  /**
+   * When the stringAsXml runtime property is set to true we treat the simple
+   * content as if it were XML and output it to the infoset. We use this value
+   * as the runtime property name. Additionally, when we output the XML we need
+   * a wrapper element to reset the default XML namespace in case the infoset
+   * has a default namespace. We also use this variable as the name of that
+   * wrapper element.
+   */
+  lazy val stringAsXml = "stringAsXml"
+
+  /**
+   * Read an event from an XMLStreamReader and write the same event to an
+   * XMLStreamWriter. For example, this can be used for convert XML from one
+   * type of input to a different type (e.g. InputStream to String), check for
+   * well-formed XML, and perform some basic normalization. Example usage of
+   * this function is to call this in a loop like this:
+   *
+   *   while (xsr.hasNext()) {
+   *     XMLTextInfoset.writeStreamEvent(xsr, xsw)
+   *     xsr.next()
+   *   }
+   *
+   * This approach allows callers to skip certain events by conditionally
+   * calling this function.
+   */
+  def writeXMLStreamEvent(xsr: XMLStreamReader, xsw: XMLStreamWriter): Unit = {
+    xsr.getEventType() match {
+      case START_ELEMENT => {
+        xsw.writeStartElement(
+          xsr.getPrefix(),
+          xsr.getLocalName(),
+          xsr.getNamespaceURI())
+        for (i <- 0 until xsr.getNamespaceCount()) {
+          xsw.writeNamespace(
+            xsr.getNamespacePrefix(i),
+            xsr.getNamespaceURI(i))
+        }
+        for (i <- 0 until xsr.getAttributeCount()) {
+          xsw.writeAttribute(
+            xsr.getAttributePrefix(i),
+            xsr.getAttributeNamespace(i),
+            xsr.getAttributeLocalName(i),
+            xsr.getAttributeValue(i))
+        }
+      }
+      case END_ELEMENT => xsw.writeEndElement()
+      case CHARACTERS => xsw.writeCharacters(xsr.getText())
+      case COMMENT => xsw.writeComment(xsr.getText())

Review Comment:
   Yep, I *think* much of this will be preserved. Though I didn't test it in much detail. This is really just a proxy function that converts XMLStreamReader events to XMLStreamWriter events. It's pretty straightforward to just handle all the events. I'm not sure exactly in what circumstances the woodstox XMLStreamReader will actually use those events. For example, with coalescing turned off, i noticed CDATA events were never created from the reader. But if it ever does, we can support them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r941436626


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/InfosetWalker.scala:
##########
@@ -408,14 +413,23 @@ class InfosetWalker private (
     containerIndexStack.setTop(top + 1)
   }
 
+  @inline
+  private def doOutputter(outputterFunc: => Boolean, desc: String, context: ThrowsSDE): Unit = {
+    Try(outputterFunc) match {
+      case Success(true) => // success
+      case Success(false) => context.SDE("Failed to %s", desc)

Review Comment:
   I'm ok with putting this off to a separate change set. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on code in PR #819:
URL: https://github.com/apache/daffodil/pull/819#discussion_r941438353


##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/infoset/XMLTextInfosetInputter.scala:
##########
@@ -17,24 +17,27 @@
 
 package org.apache.daffodil.infoset
 
+import java.io.StringWriter
+import java.nio.charset.StandardCharsets
+import javax.xml.XMLConstants
+import javax.xml.stream.XMLInputFactory
+import javax.xml.stream.XMLStreamConstants._
+import javax.xml.stream.XMLStreamException
+import javax.xml.stream.XMLStreamReader
+import javax.xml.stream.XMLStreamWriter
+import javax.xml.stream.util.XMLEventAllocator
+
+import org.apache.daffodil.dpath.NodeInfo
 import org.apache.daffodil.exceptions.Assert
+import org.apache.daffodil.infoset.InfosetInputterEventType._
+import org.apache.daffodil.util.MaybeBoolean
 import org.apache.daffodil.util.Misc
 import org.apache.daffodil.xml.XMLUtils
-import org.apache.daffodil.util.MaybeBoolean
-import org.apache.daffodil.dpath.NodeInfo
-import org.apache.daffodil.infoset.InfosetInputterEventType._
-
-import javax.xml.stream.XMLStreamReader
-import javax.xml.stream.XMLStreamConstants._
-import javax.xml.stream.XMLInputFactory
-import javax.xml.stream.util.XMLEventAllocator
-import javax.xml.stream.XMLStreamException
-import javax.xml.XMLConstants
 
-object XMLTextInfosetInputter {
+object XMLTextInfoset {

Review Comment:
   I changed the name because this object is now used in both XMLTextInfosetInputter and XMLTextInfosetOutputter, with the `xmlOutputFactory` and the `writeXmlStreamEvent` needed in both of them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on pull request #819: Support XML strings in XMLTextInfosetInputter/Outputter

Posted by GitBox <gi...@apache.org>.
stevedlawrence commented on PR #819:
URL: https://github.com/apache/daffodil/pull/819#issuecomment-1212145553

   I believe I've addressed all the comments. Please take another look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org