You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@daffodil.apache.org by Roger L Costello <co...@mitre.org> on 2023/05/17 12:35:21 UTC

The performance of Daffodil at the command line is horrible

The input file is 375 MB
The XML file that DFDL parsing generates is 4.67 GB

Time required for Daffodil to parse the input and generate the XML file is 16 minutes, 24 seconds.

Ugh!

That is too long. My customers will laugh at me if I suggest they use a tool that takes 16 minutes to parse their data.

Below is the skeletal structure of my DFDL schema. I am pretty sure the "choice" is the cause of the slowness. I don't see an alternative to the choice; each record of the input could be one of the choices (i.e., the input records aren't in any order). Any suggestions for improving the performance?

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:fn="http://www.w3.org/2005/xpath-functions"
    xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">
    
    <xs:annotation>
        <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:format
                alignment="1" 
                alignmentUnits="bytes" 
                choiceLengthKind="implicit"
                emptyValueDelimiterPolicy="none" 
                encoding="ASCII" 
                encodingErrorPolicy="replace" 
                escapeSchemeRef="" 
                fillByte="%SP;" 
                floating="no" 
                ignoreCase="yes" 
                initiatedContent="no" 
                initiator="" 
                leadingSkip="0"
                lengthKind="delimited" 
                lengthUnits="characters" 
                nilValueDelimiterPolicy="none" 
                occursCountKind="implicit" 
                outputNewLine="%CR;%LF;" 
                representation="text" 
                separator="" 
                separatorSuppressionPolicy="anyEmpty" 
                sequenceKind="ordered" 
                textBidi="no" 
                textPadKind="none"
                textTrimKind="none" 
                trailingSkip="0" 
                truncateSpecifiedLengthString="no" 
                terminator="" 
                textNumberRep="standard" 
                textStandardBase="10" 
                textStandardZeroRep="0" 
                textNumberRounding="pattern" 
                textStandardExponentRep="E" 
                textNumberCheckPolicy="strict"
            />
        </xs:appinfo>
    </xs:annotation>
    
    <xs:element name="Test">
        <xs:complexType>
            <xs:sequence dfdl:separator="%NL;" dfdl:separatorPosition="infix">
                <xs:element name="record" maxOccurs="unbounded" >
                    <xs:complexType>
                        <xs:choice>
                            <xs:element ref="A" />                                        
                            <xs:element ref="B" />                                        
                            <xs:element ref="C" />                                              
                            <xs:element ref="D" />                                        
                            <!-- A hundred more of these element ref's -->
                        </xs:choice>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

Re: The performance of Daffodil at the command line is horrible

Posted by Mike Beckerle <mb...@apache.org>.

The choice is certainly the likely suspect.

What you have here is an O(n * m) algorithm where n is how many records and
m is the number of record types.

So, how does the format determine which record type, A, B, C, .... is the
one in the data?

Most formats will have one or a small handful of different criteria used,
based on common initial parts of the data stream.

The secret is to capture those in exactly one place in the schema and
expose it before the choice, so that the choice can exploit that common
structure.



On Wed, May 17, 2023 at 8:35 AM Roger L Costello <co...@mitre.org> wrote:

> The input file is 375 MB
> The XML file that DFDL parsing generates is 4.67 GB
>
> Time required for Daffodil to parse the input and generate the XML file is
> 16 minutes, 24 seconds.
>
> Ugh!
>
> That is too long. My customers will laugh at me if I suggest they use a
> tool that takes 16 minutes to parse their data.
>
> Below is the skeletal structure of my DFDL schema. I am pretty sure the
> "choice" is the cause of the slowness. I don't see an alternative to the
> choice; each record of the input could be one of the choices (i.e., the
> input records aren't in any order). Any suggestions for improving the
> performance?
>
> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
>     xmlns:fn="http://www.w3.org/2005/xpath-functions"
>     xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">
>
>     <xs:annotation>
>         <xs:appinfo source="http://www.ogf.org/dfdl/">
>             <dfdl:format
>                 alignment="1"
>                 alignmentUnits="bytes"
>                 choiceLengthKind="implicit"
>                 emptyValueDelimiterPolicy="none"
>                 encoding="ASCII"
>                 encodingErrorPolicy="replace"
>                 escapeSchemeRef=""
>                 fillByte="%SP;"
>                 floating="no"
>                 ignoreCase="yes"
>                 initiatedContent="no"
>                 initiator=""
>                 leadingSkip="0"
>                 lengthKind="delimited"
>                 lengthUnits="characters"
>                 nilValueDelimiterPolicy="none"
>                 occursCountKind="implicit"
>                 outputNewLine="%CR;%LF;"
>                 representation="text"
>                 separator=""
>                 separatorSuppressionPolicy="anyEmpty"
>                 sequenceKind="ordered"
>                 textBidi="no"
>                 textPadKind="none"
>                 textTrimKind="none"
>                 trailingSkip="0"
>                 truncateSpecifiedLengthString="no"
>                 terminator=""
>                 textNumberRep="standard"
>                 textStandardBase="10"
>                 textStandardZeroRep="0"
>                 textNumberRounding="pattern"
>                 textStandardExponentRep="E"
>                 textNumberCheckPolicy="strict"
>             />
>         </xs:appinfo>
>     </xs:annotation>
>
>     <xs:element name="Test">
>         <xs:complexType>
>             <xs:sequence dfdl:separator="%NL;"
> dfdl:separatorPosition="infix">
>                 <xs:element name="record" maxOccurs="unbounded" >
>                     <xs:complexType>
>                         <xs:choice>
>                             <xs:element ref="A" />
>
>                             <xs:element ref="B" />
>
>                             <xs:element ref="C" />
>
>                             <xs:element ref="D" />
>
>                             <!-- A hundred more of these element ref's -->
>                         </xs:choice>
>                     </xs:complexType>
>                 </xs:element>
>             </xs:sequence>
>         </xs:complexType>
>     </xs:element>
>

Re: A deadline for Drill + Daffodil Integration - ApacheCon in Oct.

Posted by Charles Givre <cg...@gmail.com>.

Concur +1 From me!
-- C



> On Jul 6, 2023, at 12:46 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> That's a cool abstract.
> 
> 
> 
> On Thu, Jul 6, 2023 at 8:29 AM Mike Beckerle <mb...@apache.org> wrote:
> 
>> I decided the only way to force getting this Drill + Daffodil integration
>> done, or at least started, is to have a deadline.
>> 
>> So I submitted this abstract below for the upcoming "Community over Code"
>> (formerly known as ApacheCon) conference this fall (Oct 7-10)
>> 
>> I'm hoping this forces some of the refactoring that is gating other efforts
>> and fixes in Daffodil at the same time.
>> 
>> *Direct Query of Arbitrary Data Formats using Apache Drill and Apache
>> Daffodil*
>> 
>> 
>> *Suppose you have data in an ad-hoc data format like **EDIFACT, ISO8583,
>> Asterix, some COBOL FD, or any other kind of data. **You can now describe
>> it with a Data Format Description Language (DFDL) schema, then u**sing
>> Apache Drill, you can directly query that data and those queries can also
>> incorporate data from any of Apache Drill's other array of data sources.
>> **This
>> talk will describe the integration of Apache Drill with Apache Daffodil's
>> DFDL implementation. **This deep integration implements Drill's metadata
>> model in terms of the Daffodil DFDL metadata model, and implements Drill's
>> data model in terms of the Daffodil DFDL Infoset API. This enables Drill
>> queries to operate intelligently on DFDL-described data without the cost of
>> data conversion into an expensive intermediate form like JSON or XML. **The
>> talk will highlight the specific challenges in this integration and the
>> lessons learned that are applicable to integration of other Apache projects
>> having their own metadata and data models. *
>>

Re: A deadline for Drill + Daffodil Integration - ApacheCon in Oct.

Posted by Mike Beckerle <mb...@apache.org>.

Thanks. I made it up myself. Didn't even use ChatGPT :-)

On Thu, Jul 6, 2023 at 12:46 PM Ted Dunning <te...@gmail.com> wrote:

> That's a cool abstract.
>
>
>
> On Thu, Jul 6, 2023 at 8:29 AM Mike Beckerle <mb...@apache.org> wrote:
>
> > I decided the only way to force getting this Drill + Daffodil integration
> > done, or at least started, is to have a deadline.
> >
> > So I submitted this abstract below for the upcoming "Community over Code"
> > (formerly known as ApacheCon) conference this fall (Oct 7-10)
> >
> > I'm hoping this forces some of the refactoring that is gating other
> efforts
> > and fixes in Daffodil at the same time.
> >
> > *Direct Query of Arbitrary Data Formats using Apache Drill and Apache
> > Daffodil*
> >
> >
> > *Suppose you have data in an ad-hoc data format like **EDIFACT, ISO8583,
> > Asterix, some COBOL FD, or any other kind of data. **You can now describe
> > it with a Data Format Description Language (DFDL) schema, then u**sing
> > Apache Drill, you can directly query that data and those queries can also
> > incorporate data from any of Apache Drill's other array of data sources.
> > **This
> > talk will describe the integration of Apache Drill with Apache Daffodil's
> > DFDL implementation. **This deep integration implements Drill's metadata
> > model in terms of the Daffodil DFDL metadata model, and implements
> Drill's
> > data model in terms of the Daffodil DFDL Infoset API. This enables Drill
> > queries to operate intelligently on DFDL-described data without the cost
> of
> > data conversion into an expensive intermediate form like JSON or XML.
> **The
> > talk will highlight the specific challenges in this integration and the
> > lessons learned that are applicable to integration of other Apache
> projects
> > having their own metadata and data models. *
> >
>

Re: A deadline for Drill + Daffodil Integration - ApacheCon in Oct.

Posted by Charles Givre <cg...@gmail.com>.

Concur +1 From me!
-- C



> On Jul 6, 2023, at 12:46 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> That's a cool abstract.
> 
> 
> 
> On Thu, Jul 6, 2023 at 8:29 AM Mike Beckerle <mb...@apache.org> wrote:
> 
>> I decided the only way to force getting this Drill + Daffodil integration
>> done, or at least started, is to have a deadline.
>> 
>> So I submitted this abstract below for the upcoming "Community over Code"
>> (formerly known as ApacheCon) conference this fall (Oct 7-10)
>> 
>> I'm hoping this forces some of the refactoring that is gating other efforts
>> and fixes in Daffodil at the same time.
>> 
>> *Direct Query of Arbitrary Data Formats using Apache Drill and Apache
>> Daffodil*
>> 
>> 
>> *Suppose you have data in an ad-hoc data format like **EDIFACT, ISO8583,
>> Asterix, some COBOL FD, or any other kind of data. **You can now describe
>> it with a Data Format Description Language (DFDL) schema, then u**sing
>> Apache Drill, you can directly query that data and those queries can also
>> incorporate data from any of Apache Drill's other array of data sources.
>> **This
>> talk will describe the integration of Apache Drill with Apache Daffodil's
>> DFDL implementation. **This deep integration implements Drill's metadata
>> model in terms of the Daffodil DFDL metadata model, and implements Drill's
>> data model in terms of the Daffodil DFDL Infoset API. This enables Drill
>> queries to operate intelligently on DFDL-described data without the cost of
>> data conversion into an expensive intermediate form like JSON or XML. **The
>> talk will highlight the specific challenges in this integration and the
>> lessons learned that are applicable to integration of other Apache projects
>> having their own metadata and data models. *
>>

Re: A deadline for Drill + Daffodil Integration - ApacheCon in Oct.

Posted by Ted Dunning <te...@gmail.com>.

That's a cool abstract.



On Thu, Jul 6, 2023 at 8:29 AM Mike Beckerle <mb...@apache.org> wrote:

> I decided the only way to force getting this Drill + Daffodil integration
> done, or at least started, is to have a deadline.
>
> So I submitted this abstract below for the upcoming "Community over Code"
> (formerly known as ApacheCon) conference this fall (Oct 7-10)
>
> I'm hoping this forces some of the refactoring that is gating other efforts
> and fixes in Daffodil at the same time.
>
> *Direct Query of Arbitrary Data Formats using Apache Drill and Apache
> Daffodil*
>
>
> *Suppose you have data in an ad-hoc data format like **EDIFACT, ISO8583,
> Asterix, some COBOL FD, or any other kind of data. **You can now describe
> it with a Data Format Description Language (DFDL) schema, then u**sing
> Apache Drill, you can directly query that data and those queries can also
> incorporate data from any of Apache Drill's other array of data sources.
> **This
> talk will describe the integration of Apache Drill with Apache Daffodil's
> DFDL implementation. **This deep integration implements Drill's metadata
> model in terms of the Daffodil DFDL metadata model, and implements Drill's
> data model in terms of the Daffodil DFDL Infoset API. This enables Drill
> queries to operate intelligently on DFDL-described data without the cost of
> data conversion into an expensive intermediate form like JSON or XML. **The
> talk will highlight the specific challenges in this integration and the
> lessons learned that are applicable to integration of other Apache projects
> having their own metadata and data models. *
>

Re: A deadline for Drill + Daffodil Integration - ApacheCon in Oct.

Posted by Ted Dunning <te...@gmail.com>.

That's a cool abstract.



On Thu, Jul 6, 2023 at 8:29 AM Mike Beckerle <mb...@apache.org> wrote:

> I decided the only way to force getting this Drill + Daffodil integration
> done, or at least started, is to have a deadline.
>
> So I submitted this abstract below for the upcoming "Community over Code"
> (formerly known as ApacheCon) conference this fall (Oct 7-10)
>
> I'm hoping this forces some of the refactoring that is gating other efforts
> and fixes in Daffodil at the same time.
>
> *Direct Query of Arbitrary Data Formats using Apache Drill and Apache
> Daffodil*
>
>
> *Suppose you have data in an ad-hoc data format like **EDIFACT, ISO8583,
> Asterix, some COBOL FD, or any other kind of data. **You can now describe
> it with a Data Format Description Language (DFDL) schema, then u**sing
> Apache Drill, you can directly query that data and those queries can also
> incorporate data from any of Apache Drill's other array of data sources.
> **This
> talk will describe the integration of Apache Drill with Apache Daffodil's
> DFDL implementation. **This deep integration implements Drill's metadata
> model in terms of the Daffodil DFDL metadata model, and implements Drill's
> data model in terms of the Daffodil DFDL Infoset API. This enables Drill
> queries to operate intelligently on DFDL-described data without the cost of
> data conversion into an expensive intermediate form like JSON or XML. **The
> talk will highlight the specific challenges in this integration and the
> lessons learned that are applicable to integration of other Apache projects
> having their own metadata and data models. *
>

A deadline for Drill + Daffodil Integration - ApacheCon in Oct.

Posted by Mike Beckerle <mb...@apache.org>.

I decided the only way to force getting this Drill + Daffodil integration
done, or at least started, is to have a deadline.

So I submitted this abstract below for the upcoming "Community over Code"
(formerly known as ApacheCon) conference this fall (Oct 7-10)

I'm hoping this forces some of the refactoring that is gating other efforts
and fixes in Daffodil at the same time.

*Direct Query of Arbitrary Data Formats using Apache Drill and Apache
Daffodil*


*Suppose you have data in an ad-hoc data format like **EDIFACT, ISO8583,
Asterix, some COBOL FD, or any other kind of data. **You can now describe
it with a Data Format Description Language (DFDL) schema, then u**sing
Apache Drill, you can directly query that data and those queries can also
incorporate data from any of Apache Drill's other array of data sources. **This
talk will describe the integration of Apache Drill with Apache Daffodil's
DFDL implementation. **This deep integration implements Drill's metadata
model in terms of the Daffodil DFDL metadata model, and implements Drill's
data model in terms of the Daffodil DFDL Infoset API. This enables Drill
queries to operate intelligently on DFDL-described data without the cost of
data conversion into an expensive intermediate form like JSON or XML. **The
talk will highlight the specific challenges in this integration and the
lessons learned that are applicable to integration of other Apache projects
having their own metadata and data models. *

A deadline for Drill + Daffodil Integration - ApacheCon in Oct.

Posted by Mike Beckerle <mb...@apache.org>.

I decided the only way to force getting this Drill + Daffodil integration
done, or at least started, is to have a deadline.

So I submitted this abstract below for the upcoming "Community over Code"
(formerly known as ApacheCon) conference this fall (Oct 7-10)

I'm hoping this forces some of the refactoring that is gating other efforts
and fixes in Daffodil at the same time.

*Direct Query of Arbitrary Data Formats using Apache Drill and Apache
Daffodil*


*Suppose you have data in an ad-hoc data format like **EDIFACT, ISO8583,
Asterix, some COBOL FD, or any other kind of data. **You can now describe
it with a Data Format Description Language (DFDL) schema, then u**sing
Apache Drill, you can directly query that data and those queries can also
incorporate data from any of Apache Drill's other array of data sources. **This
talk will describe the integration of Apache Drill with Apache Daffodil's
DFDL implementation. **This deep integration implements Drill's metadata
model in terms of the Daffodil DFDL metadata model, and implements Drill's
data model in terms of the Daffodil DFDL Infoset API. This enables Drill
queries to operate intelligently on DFDL-described data without the cost of
data conversion into an expensive intermediate form like JSON or XML. **The
talk will highlight the specific challenges in this integration and the
lessons learned that are applicable to integration of other Apache projects
having their own metadata and data models. *

Re: The performance of Daffodil at the command line is horrible

Posted by Charles Givre <cg...@gmail.com>.

Hello all, 
To weigh in here... this would be a great use case for a partnership with Apache Drill.  Drill can read XML natively but not always accurately.  Using DFDL to provide a schema for Drill would be a HUGE win.  If anyone is interested in revisiting that thread, I'd be happy to resume the conversation.
Best,
-- C



> On May 17, 2023, at 2:02 PM, Mike Beckerle <mb...@apache.org> wrote:
> 
> How do you know if it is character 6 or character 13 that has the subsection code? I assume that depends on the character 5 section code?
> 
> What is in characters 1-4 and 6-12 ? Different for every record type?
> 
> There is a pure-DFDL answer to this which I don't have enough info yet to explain, and there is a Daffodil extension, the dfdlx:lookAhead() function. The latter is obvious how to use. You look ahead at characters 5, 6, and 13, then convert your choice into a 'choice-by-dispatch' which is constant time, not O(m) time. 
> 
> https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+DFDLX+lookAhead
> 
> This stuff comes up often enough that I'm thinking about a layer to let you easily examine a part of the data stream twice - once to learn from it, a second time to actually parse it. In your case you want to examine bytes 1 to 13 twice. Once to learn the section code and subsection code, a second time when actually parsing the message. 
> 
> 
> 
> 
> 
> 
> On Wed, May 17, 2023 at 9:42 AM Roger L Costello <costello@mitre.org <ma...@mitre.org>> wrote:
>> Hi Mike,
>> 
>>  
>> 
>> how does the format determine which record type, A, B, C, .... is the one in the data?
>>  
>> 
>> The input consists of lines. Each line is exactly 132 characters.
>> 
>>  
>> 
>> The type of a line is determined by a 1-character section code plus a 1-character subsection code. The section code is always located at character 5. The subsection code is always located either at character 6 or at character 13. Given that, how would I modify my DFDL schema to improve its performance?
>> 
>>  
>> 
>> From: Mike Beckerle <mbeckerle@apache.org <ma...@apache.org>> 
>> Sent: Wednesday, May 17, 2023 9:13 AM
>> To: users@daffodil.apache.org <ma...@daffodil.apache.org>
>> Subject: [EXT] Re: The performance of Daffodil at the command line is horrible
>> 
>>  
>> 
>>  
>> 
>> The choice is certainly the likely suspect.
>> 
>>  
>> 
>> What you have here is an O(n * m) algorithm where n is how many records and m is the number of record types. 
>> 
>>  
>> 
>> So, how does the format determine which record type, A, B, C, .... is the one in the data? 
>> 
>>  
>> 
>> Most formats will have one or a small handful of different criteria used, based on common initial parts of the data stream.
>> 
>>  
>> 
>> The secret is to capture those in exactly one place in the schema and expose it before the choice, so that the choice can exploit that common structure. 
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> On Wed, May 17, 2023 at 8:35 AM Roger L Costello <costello@mitre.org <ma...@mitre.org>> wrote:
>> 
>> The input file is 375 MB
>> The XML file that DFDL parsing generates is 4.67 GB
>> 
>> Time required for Daffodil to parse the input and generate the XML file is 16 minutes, 24 seconds.
>> 
>> Ugh!
>> 
>> That is too long. My customers will laugh at me if I suggest they use a tool that takes 16 minutes to parse their data.
>> 
>> Below is the skeletal structure of my DFDL schema. I am pretty sure the "choice" is the cause of the slowness. I don't see an alternative to the choice; each record of the input could be one of the choices (i.e., the input records aren't in any order). Any suggestions for improving the performance?
>> 
>> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
>>     xmlns:fn="http://www.w3.org/2005/xpath-functions"
>>     xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">
>> 
>>     <xs:annotation>
>>         <xs:appinfo source="http://www.ogf.org/dfdl/">
>>             <dfdl:format
>>                 alignment="1" 
>>                 alignmentUnits="bytes" 
>>                 choiceLengthKind="implicit"
>>                 emptyValueDelimiterPolicy="none" 
>>                 encoding="ASCII" 
>>                 encodingErrorPolicy="replace" 
>>                 escapeSchemeRef="" 
>>                 fillByte="%SP;" 
>>                 floating="no" 
>>                 ignoreCase="yes" 
>>                 initiatedContent="no" 
>>                 initiator="" 
>>                 leadingSkip="0"
>>                 lengthKind="delimited" 
>>                 lengthUnits="characters" 
>>                 nilValueDelimiterPolicy="none" 
>>                 occursCountKind="implicit" 
>>                 outputNewLine="%CR;%LF;" 
>>                 representation="text" 
>>                 separator="" 
>>                 separatorSuppressionPolicy="anyEmpty" 
>>                 sequenceKind="ordered" 
>>                 textBidi="no" 
>>                 textPadKind="none"
>>                 textTrimKind="none" 
>>                 trailingSkip="0" 
>>                 truncateSpecifiedLengthString="no" 
>>                 terminator="" 
>>                 textNumberRep="standard" 
>>                 textStandardBase="10" 
>>                 textStandardZeroRep="0" 
>>                 textNumberRounding="pattern" 
>>                 textStandardExponentRep="E" 
>>                 textNumberCheckPolicy="strict"
>>             />
>>         </xs:appinfo>
>>     </xs:annotation>
>> 
>>     <xs:element name="Test">
>>         <xs:complexType>
>>             <xs:sequence dfdl:separator="%NL;" dfdl:separatorPosition="infix">
>>                 <xs:element name="record" maxOccurs="unbounded" >
>>                     <xs:complexType>
>>                         <xs:choice>
>>                             <xs:element ref="A" />                                        
>>                             <xs:element ref="B" />                                        
>>                             <xs:element ref="C" />                                              
>>                             <xs:element ref="D" />                                        
>>                             <!-- A hundred more of these element ref's -->
>>                         </xs:choice>
>>                     </xs:complexType>
>>                 </xs:element>
>>             </xs:sequence>
>>         </xs:complexType>
>>     </xs:element>
>>

Re: The performance of Daffodil at the command line is horrible

Posted by Mike Beckerle <mb...@apache.org>.

How do you know if it is character 6 or character 13 that has the
subsection code? I assume that depends on the character 5 section code?

What is in characters 1-4 and 6-12 ? Different for every record type?

There is a pure-DFDL answer to this which I don't have enough info yet to
explain, and there is a Daffodil extension, the dfdlx:lookAhead() function.
The latter is obvious how to use. You look ahead at characters 5, 6, and
13, then convert your choice into a 'choice-by-dispatch' which is constant
time, not O(m) time.

https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+DFDLX+lookAhead

This stuff comes up often enough that I'm thinking about a layer to let you
easily examine a part of the data stream twice - once to learn from it, a
second time to actually parse it. In your case you want to examine bytes 1
to 13 twice. Once to learn the section code and subsection code, a second
time when actually parsing the message.






On Wed, May 17, 2023 at 9:42 AM Roger L Costello <co...@mitre.org> wrote:

> Hi Mike,
>
>
>
>    - how does the format determine which record type, A, B, C, .... is
>    the one in the data?
>
>
>
> The input consists of lines. Each line is exactly 132 characters.
>
>
>
> The type of a line is determined by a 1-character section code plus a
> 1-character subsection code. The section code is always located at
> character 5. The subsection code is always located either at character 6 or
> at character 13. Given that, how would I modify my DFDL schema to improve
> its performance?
>
>
>
> *From:* Mike Beckerle <mb...@apache.org>
> *Sent:* Wednesday, May 17, 2023 9:13 AM
> *To:* users@daffodil.apache.org
> *Subject:* [EXT] Re: The performance of Daffodil at the command line is
> horrible
>
>
>
>
>
> The choice is certainly the likely suspect.
>
>
>
> What you have here is an O(n * m) algorithm where n is how many records
> and m is the number of record types.
>
>
>
> So, how does the format determine which record type, A, B, C, .... is the
> one in the data?
>
>
>
> Most formats will have one or a small handful of different criteria used,
> based on common initial parts of the data stream.
>
>
>
> The secret is to capture those in exactly one place in the schema and
> expose it before the choice, so that the choice can exploit that common
> structure.
>
>
>
>
>
>
>
> On Wed, May 17, 2023 at 8:35 AM Roger L Costello <co...@mitre.org>
> wrote:
>
> The input file is 375 MB
> The XML file that DFDL parsing generates is 4.67 GB
>
> Time required for Daffodil to parse the input and generate the XML file is
> 16 minutes, 24 seconds.
>
> Ugh!
>
> That is too long. My customers will laugh at me if I suggest they use a
> tool that takes 16 minutes to parse their data.
>
> Below is the skeletal structure of my DFDL schema. I am pretty sure the
> "choice" is the cause of the slowness. I don't see an alternative to the
> choice; each record of the input could be one of the choices (i.e., the
> input records aren't in any order). Any suggestions for improving the
> performance?
>
> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
>     xmlns:fn="http://www.w3.org/2005/xpath-functions"
>     xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">
>
>     <xs:annotation>
>         <xs:appinfo source="http://www.ogf.org/dfdl/">
>             <dfdl:format
>                 alignment="1"
>                 alignmentUnits="bytes"
>                 choiceLengthKind="implicit"
>                 emptyValueDelimiterPolicy="none"
>                 encoding="ASCII"
>                 encodingErrorPolicy="replace"
>                 escapeSchemeRef=""
>                 fillByte="%SP;"
>                 floating="no"
>                 ignoreCase="yes"
>                 initiatedContent="no"
>                 initiator=""
>                 leadingSkip="0"
>                 lengthKind="delimited"
>                 lengthUnits="characters"
>                 nilValueDelimiterPolicy="none"
>                 occursCountKind="implicit"
>                 outputNewLine="%CR;%LF;"
>                 representation="text"
>                 separator=""
>                 separatorSuppressionPolicy="anyEmpty"
>                 sequenceKind="ordered"
>                 textBidi="no"
>                 textPadKind="none"
>                 textTrimKind="none"
>                 trailingSkip="0"
>                 truncateSpecifiedLengthString="no"
>                 terminator=""
>                 textNumberRep="standard"
>                 textStandardBase="10"
>                 textStandardZeroRep="0"
>                 textNumberRounding="pattern"
>                 textStandardExponentRep="E"
>                 textNumberCheckPolicy="strict"
>             />
>         </xs:appinfo>
>     </xs:annotation>
>
>     <xs:element name="Test">
>         <xs:complexType>
>             <xs:sequence dfdl:separator="%NL;"
> dfdl:separatorPosition="infix">
>                 <xs:element name="record" maxOccurs="unbounded" >
>                     <xs:complexType>
>                         <xs:choice>
>                             <xs:element ref="A" />
>
>                             <xs:element ref="B" />
>
>                             <xs:element ref="C" />
>
>                             <xs:element ref="D" />
>
>                             <!-- A hundred more of these element ref's -->
>                         </xs:choice>
>                     </xs:complexType>
>                 </xs:element>
>             </xs:sequence>
>         </xs:complexType>
>     </xs:element>
>
>

Re: The performance of Daffodil at the command line is horrible

Posted by Mike Beckerle <mb...@apache.org>.

Ok, So lookahead is very painful in DFDL.

In plain DFDL the only lookahead feature is dfdl:assert or
dfdl:discriminator with testKind 'pattern'. This peeks at the data stream
to see if the regex matches at the current position.

That won't work because you need to look at two places, and they are not
first in the data stream and furthermore you need to dispatch on them, not
just know true/false if they have a specific value.

That's why the dfdlx:lookahead() function was added to Daffodil as an
experimental extension.

But it's a hack, doesn't work well for text, and we really want something
easier to use.

But first,.... How to get by without any lookahead feature.

You can capture the P vs. D distinction first like this:

<element name="prefix" type="xs:string" dfdl:length="4"/> <!-- first 4
chars -->
<element name="code" type="xs:string" dfdl:length="1"/>
<choice dfdl:choiceDispatchKey='{ code }'>
  <sequence dfdl:choiceBranchKey="P">
      ....
   </sequence>
   <sequence dfdl:choiceBranchKey="D">
       ...
    </sequence>
</choice>

After that consider the sequence for the P records.

<element name="subcode" type="xs:string" dfdl:length="1"/> <!-- char 6
which is the subcode for P records -->
<choice dfdl:choiceDispatchKey='{ subcode }'>
     ... bunch of choice branches with dfdl:choiceBranchKey values for all
the P record subcodes ...handles all fields in chars 7 to 132.
</choice>

And for the D records:

<element name="moreChars" type="xs:string" dfdl:length="7"/>
<element name="subcode" type="xs:string" dfdl:length="1"/> <!-- char 13
which is the subcode for D records -->
<choice dfdl:choiceDispatchKey='{ subcode }'>
     ... bunch of choice branches with dfdl:choiceBranchKey values for all
the D record subcodes ...handles all fields in chars 14 to 132.
</choice>

Ok, but there's the nasty issue of what about the fields found in the data
we absorbed into the prefix and moreChars elements. Those were just
generically parsed above.
What if those actually have sub-elements within them.

So any sub-elements within the prefix have to be pulled out by parsing
those manually.

By manually I mean via substringing, i.e, like this...  suppose for one of
the P records prefix contains two fields, each integers, one is the first 2
chars, the second is the 2nd 2 chars.
<element name="num1" type="xs:int" dfdl:inputValueCalc='{ xs:int(
fn:substring( ../prefix, 1, 2) ) }'/>
<element name="num2" type="xs:int" dfdl:inputValueCalc='{ xs:int(
fn:substring( ../prefix, 3, 2) ) }'/>

We are using substring to parse into the prefix element. This is blecky but
it' s only 4 chars, so how bad can it be?

The D records are similar, except you have both prefix for chars 1 to 4,
and another element  moreChars with chars 6-12 also that have to be parsed
via the substring hack.

It's blecky. But can work. In your case it's a grand total of 11 characters
of data (4 from prefix, 7 from moreChars) that you have to cope with this
way.

Put together the combination of these techniques and you will get O(1)
dispatch to the proper record definition. That should solve your
performance problem, or at least knock it from O(n * m) to O(n).

The sheer clunkiness of expressing this is significant so clearly shows
that some better way of doing short look-aheads into data that is more
powerful than just dfdl:assert with testKind 'pattern' is needed.

Based on this use case, what I want in DFDL someday is something more like
this:

<element name="preScan"  dfdl:length="13" dfdl:lookahead="true"> <!-- new
property dfdl:lookahead='true' -->
   <sequence dfdl:leadingSkip="4">
       <element name="code" type="xs:string" dfdl:length="1"/> <!-- byte 5
-->
       <element name="subcode1" type="xs:string" dfdl:length="1"/>  <!--
byte 6 -->
       <sequence dfdl:leadingSkip="6"/>
       <element name="subcode2" type="xs:string" dfdl:length="1"/>  <!--
byte 13 -->
  </sequence>
</element>
<choice dfdl:choiceDispatchKey='{ fn:concat(preScan/code, ( if
(preScan/code eq "P") then preScan/subcode1 else preScan/subcode2) ) }'>
     <element dfdl:choiceBranchKey="PA" .../>
     <element dfdl:choiceBranchKey="PB" .../>
     ....
     <element dfdl:choiceBranchKey="DA" .../>
     <element dfdl:choiceBranchKey="DB" .../>
     ...
</choice>

The idea here is that the preScan element is populated by parsing the first
13 chars, but then the position is reset to just before it (which is what
dfdl:lookahead="true" is intended to mean), so it consumes zero bits of the
data stream.
I think the dfdl:lookahead feature would have to require the element to be
fixed length but that will still handle all the use cases I know of.

I can approximate this by writing a dfdlx:layerTransform DFDL extension for
Daffodil, I've been wanting to write that for a while so as to experiment
with this.


On Wed, May 17, 2023 at 2:13 PM Roger L Costello <co...@mitre.org> wrote:

> Hi Mike,
>
>
>
>    - How do you know if it is character 6 or character 13 that has the
>    subsection code? I assume that depends on the character 5 section code?
>
>
>
> Correct. If the section code = ‘P’ then the subsection code is in position
> 6. If the section code = ‘R’ then the subsection code is in position 13.
> Like that.
>
>
>
>    - What is in characters 1-4 and 6-12 ? Different for every record type?
>
>
>
> Correct. Different for every record type.
>
>
>
>    - There is a pure-DFDL answer
>
>
>
> Yes, that’s what I want!
>
>
>
> *From:* Mike Beckerle <mb...@apache.org>
> *Sent:* Wednesday, May 17, 2023 2:03 PM
> *To:* users@daffodil.apache.org
> *Subject:* [EXT] Re: The performance of Daffodil at the command line is
> horrible
>
>
>
> How do you know if it is character 6 or character 13 that has the
> subsection code? I assume that depends on the character 5 section code?
>
>
>
> What is in characters 1-4 and 6-12 ? Different for every record type?
>
>
>
> There is a pure-DFDL answer to this which I don't have enough info yet to
> explain, and there is a Daffodil extension, the dfdlx:lookAhead() function.
> The latter is obvious how to use. You look ahead at characters 5, 6, and
> 13, then convert your choice into a 'choice-by-dispatch' which is constant
> time, not O(m) time.
>
>
>
>
> https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+DFDLX+lookAhead
>
>
>
> This stuff comes up often enough that I'm thinking about a layer to let
> you easily examine a part of the data stream twice - once to learn from it,
> a second time to actually parse it. In your case you want to examine bytes
> 1 to 13 twice. Once to learn the section code and subsection code, a second
> time when actually parsing the message.
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 17, 2023 at 9:42 AM Roger L Costello <co...@mitre.org>
> wrote:
>
> Hi Mike,
>
>
>
>    - how does the format determine which record type, A, B, C, .... is
>    the one in the data?
>
>
>
> The input consists of lines. Each line is exactly 132 characters.
>
>
>
> The type of a line is determined by a 1-character section code plus a
> 1-character subsection code. The section code is always located at
> character 5. The subsection code is always located either at character 6 or
> at character 13. Given that, how would I modify my DFDL schema to improve
> its performance?
>
>
>
> *From:* Mike Beckerle <mb...@apache.org>
> *Sent:* Wednesday, May 17, 2023 9:13 AM
> *To:* users@daffodil.apache.org
> *Subject:* [EXT] Re: The performance of Daffodil at the command line is
> horrible
>
>
>
>
>
> The choice is certainly the likely suspect.
>
>
>
> What you have here is an O(n * m) algorithm where n is how many records
> and m is the number of record types.
>
>
>
> So, how does the format determine which record type, A, B, C, .... is the
> one in the data?
>
>
>
> Most formats will have one or a small handful of different criteria used,
> based on common initial parts of the data stream.
>
>
>
> The secret is to capture those in exactly one place in the schema and
> expose it before the choice, so that the choice can exploit that common
> structure.
>
>
>
>
>
>
>
> On Wed, May 17, 2023 at 8:35 AM Roger L Costello <co...@mitre.org>
> wrote:
>
> The input file is 375 MB
> The XML file that DFDL parsing generates is 4.67 GB
>
> Time required for Daffodil to parse the input and generate the XML file is
> 16 minutes, 24 seconds.
>
> Ugh!
>
> That is too long. My customers will laugh at me if I suggest they use a
> tool that takes 16 minutes to parse their data.
>
> Below is the skeletal structure of my DFDL schema. I am pretty sure the
> "choice" is the cause of the slowness. I don't see an alternative to the
> choice; each record of the input could be one of the choices (i.e., the
> input records aren't in any order). Any suggestions for improving the
> performance?
>
> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
>     xmlns:fn="http://www.w3.org/2005/xpath-functions"
>     xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">
>
>     <xs:annotation>
>         <xs:appinfo source="http://www.ogf.org/dfdl/">
>             <dfdl:format
>                 alignment="1"
>                 alignmentUnits="bytes"
>                 choiceLengthKind="implicit"
>                 emptyValueDelimiterPolicy="none"
>                 encoding="ASCII"
>                 encodingErrorPolicy="replace"
>                 escapeSchemeRef=""
>                 fillByte="%SP;"
>                 floating="no"
>                 ignoreCase="yes"
>                 initiatedContent="no"
>                 initiator=""
>                 leadingSkip="0"
>                 lengthKind="delimited"
>                 lengthUnits="characters"
>                 nilValueDelimiterPolicy="none"
>                 occursCountKind="implicit"
>                 outputNewLine="%CR;%LF;"
>                 representation="text"
>                 separator=""
>                 separatorSuppressionPolicy="anyEmpty"
>                 sequenceKind="ordered"
>                 textBidi="no"
>                 textPadKind="none"
>                 textTrimKind="none"
>                 trailingSkip="0"
>                 truncateSpecifiedLengthString="no"
>                 terminator=""
>                 textNumberRep="standard"
>                 textStandardBase="10"
>                 textStandardZeroRep="0"
>                 textNumberRounding="pattern"
>                 textStandardExponentRep="E"
>                 textNumberCheckPolicy="strict"
>             />
>         </xs:appinfo>
>     </xs:annotation>
>
>     <xs:element name="Test">
>         <xs:complexType>
>             <xs:sequence dfdl:separator="%NL;"
> dfdl:separatorPosition="infix">
>                 <xs:element name="record" maxOccurs="unbounded" >
>                     <xs:complexType>
>                         <xs:choice>
>                             <xs:element ref="A" />
>
>                             <xs:element ref="B" />
>
>                             <xs:element ref="C" />
>
>                             <xs:element ref="D" />
>
>                             <!-- A hundred more of these element ref's -->
>                         </xs:choice>
>                     </xs:complexType>
>                 </xs:element>
>             </xs:sequence>
>         </xs:complexType>
>     </xs:element>
>
>

Re: The performance of Daffodil at the command line is horrible

Posted by Roger L Costello <co...@mitre.org>.

Hi Mike,


  *   How do you know if it is character 6 or character 13 that has the subsection code? I assume that depends on the character 5 section code?

Correct. If the section code = ‘P’ then the subsection code is in position 6. If the section code = ‘R’ then the subsection code is in position 13. Like that.


  *   What is in characters 1-4 and 6-12 ? Different for every record type?

Correct. Different for every record type.


  *   There is a pure-DFDL answer

Yes, that’s what I want!

From: Mike Beckerle <mb...@apache.org>
Sent: Wednesday, May 17, 2023 2:03 PM
To: users@daffodil.apache.org
Subject: [EXT] Re: The performance of Daffodil at the command line is horrible

How do you know if it is character 6 or character 13 that has the subsection code? I assume that depends on the character 5 section code?

What is in characters 1-4 and 6-12 ? Different for every record type?

There is a pure-DFDL answer to this which I don't have enough info yet to explain, and there is a Daffodil extension, the dfdlx:lookAhead() function. The latter is obvious how to use. You look ahead at characters 5, 6, and 13, then convert your choice into a 'choice-by-dispatch' which is constant time, not O(m) time.

https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+DFDLX+lookAhead

This stuff comes up often enough that I'm thinking about a layer to let you easily examine a part of the data stream twice - once to learn from it, a second time to actually parse it. In your case you want to examine bytes 1 to 13 twice. Once to learn the section code and subsection code, a second time when actually parsing the message.






On Wed, May 17, 2023 at 9:42 AM Roger L Costello <co...@mitre.org>> wrote:
Hi Mike,


  *   how does the format determine which record type, A, B, C, .... is the one in the data?

The input consists of lines. Each line is exactly 132 characters.

The type of a line is determined by a 1-character section code plus a 1-character subsection code. The section code is always located at character 5. The subsection code is always located either at character 6 or at character 13. Given that, how would I modify my DFDL schema to improve its performance?

From: Mike Beckerle <mb...@apache.org>>
Sent: Wednesday, May 17, 2023 9:13 AM
To: users@daffodil.apache.org<ma...@daffodil.apache.org>
Subject: [EXT] Re: The performance of Daffodil at the command line is horrible


The choice is certainly the likely suspect.

What you have here is an O(n * m) algorithm where n is how many records and m is the number of record types.

So, how does the format determine which record type, A, B, C, .... is the one in the data?

Most formats will have one or a small handful of different criteria used, based on common initial parts of the data stream.

The secret is to capture those in exactly one place in the schema and expose it before the choice, so that the choice can exploit that common structure.



On Wed, May 17, 2023 at 8:35 AM Roger L Costello <co...@mitre.org>> wrote:
The input file is 375 MB
The XML file that DFDL parsing generates is 4.67 GB

Time required for Daffodil to parse the input and generate the XML file is 16 minutes, 24 seconds.

Ugh!

That is too long. My customers will laugh at me if I suggest they use a tool that takes 16 minutes to parse their data.

Below is the skeletal structure of my DFDL schema. I am pretty sure the "choice" is the cause of the slowness. I don't see an alternative to the choice; each record of the input could be one of the choices (i.e., the input records aren't in any order). Any suggestions for improving the performance?

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:fn="http://www.w3.org/2005/xpath-functions"
    xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">

    <xs:annotation>
        <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:format
                alignment="1"
                alignmentUnits="bytes"
                choiceLengthKind="implicit"
                emptyValueDelimiterPolicy="none"
                encoding="ASCII"
                encodingErrorPolicy="replace"
                escapeSchemeRef=""
                fillByte="%SP;"
                floating="no"
                ignoreCase="yes"
                initiatedContent="no"
                initiator=""
                leadingSkip="0"
                lengthKind="delimited"
                lengthUnits="characters"
                nilValueDelimiterPolicy="none"
                occursCountKind="implicit"
                outputNewLine="%CR;%LF;"
                representation="text"
                separator=""
                separatorSuppressionPolicy="anyEmpty"
                sequenceKind="ordered"
                textBidi="no"
                textPadKind="none"
                textTrimKind="none"
                trailingSkip="0"
                truncateSpecifiedLengthString="no"
                terminator=""
                textNumberRep="standard"
                textStandardBase="10"
                textStandardZeroRep="0"
                textNumberRounding="pattern"
                textStandardExponentRep="E"
                textNumberCheckPolicy="strict"
            />
        </xs:appinfo>
    </xs:annotation>

    <xs:element name="Test">
        <xs:complexType>
            <xs:sequence dfdl:separator="%NL;" dfdl:separatorPosition="infix">
                <xs:element name="record" maxOccurs="unbounded" >
                    <xs:complexType>
                        <xs:choice>
                            <xs:element ref="A" />
                            <xs:element ref="B" />
                            <xs:element ref="C" />
                            <xs:element ref="D" />
                            <!-- A hundred more of these element ref's -->
                        </xs:choice>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

Re: The performance of Daffodil at the command line is horrible

Posted by Roger L Costello <co...@mitre.org>.

Hi Mike,


  *   how does the format determine which record type, A, B, C, .... is the one in the data?

The input consists of lines. Each line is exactly 132 characters.

The type of a line is determined by a 1-character section code plus a 1-character subsection code. The section code is always located at character 5. The subsection code is always located either at character 6 or at character 13. Given that, how would I modify my DFDL schema to improve its performance?

From: Mike Beckerle <mb...@apache.org>
Sent: Wednesday, May 17, 2023 9:13 AM
To: users@daffodil.apache.org
Subject: [EXT] Re: The performance of Daffodil at the command line is horrible


The choice is certainly the likely suspect.

What you have here is an O(n * m) algorithm where n is how many records and m is the number of record types.

So, how does the format determine which record type, A, B, C, .... is the one in the data?

Most formats will have one or a small handful of different criteria used, based on common initial parts of the data stream.

The secret is to capture those in exactly one place in the schema and expose it before the choice, so that the choice can exploit that common structure.



On Wed, May 17, 2023 at 8:35 AM Roger L Costello <co...@mitre.org>> wrote:
The input file is 375 MB
The XML file that DFDL parsing generates is 4.67 GB

Time required for Daffodil to parse the input and generate the XML file is 16 minutes, 24 seconds.

Ugh!

That is too long. My customers will laugh at me if I suggest they use a tool that takes 16 minutes to parse their data.

Below is the skeletal structure of my DFDL schema. I am pretty sure the "choice" is the cause of the slowness. I don't see an alternative to the choice; each record of the input could be one of the choices (i.e., the input records aren't in any order). Any suggestions for improving the performance?

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:fn="http://www.w3.org/2005/xpath-functions"
    xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">

    <xs:annotation>
        <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:format
                alignment="1"
                alignmentUnits="bytes"
                choiceLengthKind="implicit"
                emptyValueDelimiterPolicy="none"
                encoding="ASCII"
                encodingErrorPolicy="replace"
                escapeSchemeRef=""
                fillByte="%SP;"
                floating="no"
                ignoreCase="yes"
                initiatedContent="no"
                initiator=""
                leadingSkip="0"
                lengthKind="delimited"
                lengthUnits="characters"
                nilValueDelimiterPolicy="none"
                occursCountKind="implicit"
                outputNewLine="%CR;%LF;"
                representation="text"
                separator=""
                separatorSuppressionPolicy="anyEmpty"
                sequenceKind="ordered"
                textBidi="no"
                textPadKind="none"
                textTrimKind="none"
                trailingSkip="0"
                truncateSpecifiedLengthString="no"
                terminator=""
                textNumberRep="standard"
                textStandardBase="10"
                textStandardZeroRep="0"
                textNumberRounding="pattern"
                textStandardExponentRep="E"
                textNumberCheckPolicy="strict"
            />
        </xs:appinfo>
    </xs:annotation>

    <xs:element name="Test">
        <xs:complexType>
            <xs:sequence dfdl:separator="%NL;" dfdl:separatorPosition="infix">
                <xs:element name="record" maxOccurs="unbounded" >
                    <xs:complexType>
                        <xs:choice>
                            <xs:element ref="A" />
                            <xs:element ref="B" />
                            <xs:element ref="C" />
                            <xs:element ref="D" />
                            <!-- A hundred more of these element ref's -->
                        </xs:choice>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>