You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by "Michael Beckerle (JIRA)" <ji...@apache.org> on 2017/11/09 14:38:00 UTC

[jira] [Commented] (DAFFODIL-1565) Parse/Unparse API Cursor-behavior/streaming enhancements

    [ https://issues.apache.org/jira/browse/DAFFODIL-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245743#comment-16245743 ] 

Michael Beckerle commented on DAFFODIL-1565:
--------------------------------------------


I was going to make a wiki page for a design discussion, but the markdown I created offline in emacs can't be imported to confluence wiki.... so I'm trying just paste it here.
 = Message Streaming API =

XML documents always have a single root element. DFDL borrows heavily from the XML Schema data model, and so also shares this concept of single root element.

However, this concept does not match many use cases.

Consider an unbounded stream (think TCP connection or Java InputStream) of message data objects. Each message is relatively small. The stream is very long or even has no end to it. Let's assume the messages are of variable length, so separating them from eachother requires parsing them.

A naive API would not return a parsed Infoset until the end of the stream is reached, and the returned infoset would, in the XML-style, always have a single root document element (e.g., <Messages>....</Messages> ) that surrounds the array of individual messages. Furthermore, usually one would check to insure that the parse call has in fact consumed all the data, and it would commonly be considered an error if the parser returned but there was data left over that wasn't consumed.

A proper functioning API can parse one message "off the front of the stream" leaving that stream positioned at the start of the next message.



== Cursor API and CLI Behavior - No Single Root ==

The Daffodil API enables a cursor-like behavior. Requirements:

* onPath - Parsing does not require an enveloping root element be part of the infoset. The API ProcessorFactory.onPath(...) call makes it possible to specify the path to an array element, such that the parse calls consume data corresponding to a single message, and return an infoset for the array element corresponding to that message.

(TBD: onPath(..) might not want to be only available on ProcessorFactory. After reloading a saved processor,...is it still sensible to call it? In general we want to eliminate the notion that a single root element is pre-identified before the data processor is even saved.)

* It is up to the application to decide whether left-over data is an error, or is expected behavior.

* The parser's state, including the input stream, should enable repeated calls to parse(). Each time it is called another array element should be parsed, consuming data from the input stream, and leaving the state such that it is prepared for another call to parse to obtain yet another.

* The parsing of an element of the array must be entirely independent of earlier array elements. No expressions can refer to earlier array elements. This allows all but one array element to exist in the infoset representation. The entire array need not be kept around. Removing them is called array pruning.

Parser corner case

* One might allow this to involve only the immediately prior array element. That way, only the prior array infoset element must be kept around for the next parse call. This issue comes up in the repeat-flags in the VMF message format. Each array element examines a flag in the prior array element in order to determine its existance. However, this doesn't occur at the level of whole VMF messages, only at the arrays of things deep within them.

If the messages are small, and their corresponding Infosets are small, this cursor-style API achieves the goal of parsing unbounded streams of small messages using only a small memory footprint.

* The Daffodil CLI has a matching mode where it implements this style of processing. The expected input is an unbounded stream of messages, and the expected output is an unbounded stream of XML elements (or other Infoset representation) with no surrounding XML document element. So long as the entire representation of a message is available on the input, it must be legal to block waiting for the parsed representation on the output. (The CLI will flush the output after each message.)

* The event-oriented equivalent of the above is that the InfosetOutputter events for an entire message must be called while consuming only the representaton of a single message. If the stream provides all bytes of a single data message, but a subsequent call to read more would block, one must be able to depend on the InfosetOutputter's events for an entire message element all being called without a blocking read on the input data stream.

Corner cases:

* Data formats seen to date only ever have expressions that reach backward by 1 array element. However, it is theoretically possible for a data format to use expressions which reach backward more than one array element. Whether these are only corner cases, or there actually is a use-case for this is unclear.  The use case where one cannot prune the arrays in the infoset is easily satisfied by a tunable which just turns the behavior off.

(Implementation Note: I would not implement this tunable - which requires documentation, and test coverage - until a use case is clear. Rather one should enable turning on/off the pruning behavior via internal flags for purposes of studying/documenting the impact it has on memory footprint. This mechanism could be turned into the tunable if a use case is found.)

For unparsing, the behavior is symmetric. Upon delivery of all the InfosetInputter events for a single message element, the unparse call should output (and flush) to the output stream the entire representation of that element.

=== Unparser Corner Case ===

* This can theoretically be off by one, in the sense that the output of the Nth message array element, may require that the existence of the N+1th message array element is known. An example of this are the repeating arrays in the MIL-STD-6017/VMF format, where a flag in each array element indicates whether it is the last one, or there is yet another one.

* This is a corner case because those kinds of array elements aren't used as the whole message/record element. At the level of whole individual VMF messages, there are no such flags nor expressions reaching across array element boundaries.

 

=== Precise I/O Stream Behavior ===

 

Note that for the message stream use case to work, the state of the parser (or unparser) must be maintained in a manner where it is ready for the next parse (or unparse) call. The stream must be flushed (unparser), the InfosetOutputter events must be flushed also meaning all events that can be generated are generated.

 

== Parsing ==

 

Suppose XML logically looks like:

 

{code}

<items>

  <item>...</item>

  ...

</items>

{code}

 

But what we really have is a stream of individual item. We want the API to return each item one by one.

 

{code}

 

val pf: ProcessorFactory = ???

 

// create a data processor, but specify that we're

// bypassing the document element for individual item objects.

 

val dp: DataProcessor = pf.onPath("/item")

 

val is: InputStream = ??? // the raw data

 

val xmlOut = new ScalaXMLInfosetOutputter

 

//

// create stateful StreamingParser object

//

// Must use from only 1 thread.

//

val sp: StreamingParser = dp.streamingParser(is, xmlOut)

 

//

// Here's a very scala way to use the streaming parser

//

def items : Stream[Node] = {

    if (sp.isEnd) {

      is.close()

      Nil

    } else {

    val item = {

      sp.reset()

      sp.parse1() // a flurry of callbacks to the Infoset outputter occur

 

      //

      // for this particular outputter, it accumulates a scala XML node

      // that we have to get out.

      xmlOut.getResult()

      }

    item #:: items

    }

}

 

// Maybe you want an iterator for java style hasNext(), next() type calling.

 

val iter: Iterator[Node] = items.toIterator

 

{code}

 == Unparsing ==

For unparsing the relationship is a little different.

 

{code}

 

val os: OutputStream = ???

 

val xmlInp = new ScalaXMLInfosetInputter() // new API. Requires reset onto a node.

 

val sup = StreamingUnparser = dp.streamingUnparser(xmlInp, os)

 

def unparseItems(var s: Stream[Node]) = {

  var exit = false

  while (!s.isEmpty && !exit) {

    val node = s.head

    xmlInp.reset(node)

    sup.unparse1()

    if (sup.isError) {

      /* error processing */

      exit = true

    }

    sup.reset()

    os.flush() // optional

  }

  os.close()

}

 

{code}

 

 

> Parse/Unparse API Cursor-behavior/streaming enhancements
> --------------------------------------------------------
>
>                 Key: DAFFODIL-1565
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-1565
>             Project: Daffodil
>          Issue Type: Improvement
>          Components: API, Performance
>            Reporter: Steve Lawrence
>             Fix For: deferred
>
>
> See review comment for context.
> A thought. This API wants to eventually be something where we can call it over and over, picking up where the prior unparse left off in the output. That might be a different API that is slightly different, but in general we need to be able to leave the output in a state such that the next unparse call can continue where the previous left off.  Doing this at the bit-position granularity might be too hard, but byte-position ought to be possible. 
> That's not really about this API, which is symmetric with our parse API. It's just an additional requirement. (The parse API also needs this call-over-and-over capability.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)