You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by "Mike Beckerle (Jira)" <ji...@apache.org> on 2022/01/05 17:06:00 UTC
[jira] [Commented] (DAFFODIL-2619) Add InfosetInputter with minimal overhead

    [ https://issues.apache.org/jira/browse/DAFFODIL-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469435#comment-17469435 ] 

Mike Beckerle commented on DAFFODIL-2619:
-----------------------------------------

Bunch of thoughts here. Multiple comments to go through them.

We need to consider nextElementErd() (line 393 of InfosetInputter.scala)

This operation is about converting name+namespace strings into an ERD.  It is a function of the serialized representation that this complex operation is needed. (E.g., if we had our own serialized representation, the exact ERD would be precisely identified in it, eliminating this overhead.)

However, the nextElementErd() method depends on the TRD Stack. That stack is maintained by the unparser as it traverses around the DFDL schema runtime data structures. This is required because a local name+namespace can map to different ERDs depending on the context in the schema. (That's the current runtime1 algorithm. I am actually coming to the view that this whole TRD-stack runtime infrastructure in daffodil-runtime-1 should be a static data structure created by the schema compiler, i.e., pre-computed and stored as part of the ERD and TRD data structures. But I digress.)

So, I think one cannot easily just create the original unparser infoset events (which contain infoset DINodes) and subsequently feed them to the unparser. 

I think this will require two passes. One unparse to capture a new kind of events object, which are NOT the same thing as the unparser's current infoset events.

A second pass uses an infoset inputter which expects this sequence of events, and pulls them on demand constructing the regular unparser InfosetEvent objects (which contain DINodes) and incrementally connecting the DINodes to form the Infoset tree.

One possible representation of these replayable infoset events would be

For complex types - a start indicator and an ERD,  an end indicator and an ERD

For simple types - an ERD and a typed value (that is, already converted from say, an XML string, into typed data.)

For arrays - a start indicator and an ERD, an end indicator and an ERD. 

That will let us isolate the unparser from aspects of unparse overhead related to the serialization format. 

 

 

> Add InfosetInputter with minimal overhead
> -----------------------------------------
>
>                 Key: DAFFODIL-2619
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2619
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Performance
>            Reporter: Steve Lawrence
>            Assignee: Steve Lawrence
>            Priority: Major
>             Fix For: 3.3.0
>
>
> When unparsing, some amount of performance can be attributed to parsing/traversing the infoset representation (e.g. text, Scala, JDOM) and converting it to the infoset events that Daffodil requires. This is done via an InfosetInputter. Because an InfosetInputter is required during unparsing (unlike parse which can have a null InfosetOutputter) it is difficult to determine how much overhead comes from the InfosetInputter and how much comes from the actual unpasing operations.
> One potential option to get a better pictures of raw unparse speed vs InfosetInputter overhead is to create a new InfosetInputter that uses a pre-created array of events that are as close as possible to what Daffodil expects. This way, the overhead of creating the event array can occur outside any measured unparser code, and the overhead of the InfosetInputter that does occur inside measured code is quite small--just the amount to index into this array and return the event information.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)