You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by Christofer Dutz <ch...@c-ware.de> on 2019/01/09 13:56:28 UTC

Using DFDL to generate model, parser and generator?

Hi all,

I am currently looking for a solution to the following question:

In the Apache PLC4X (incubating) project we are implementing a lot of different industry protocols.
Each protocol sends packets following a particular format. For each of these we currently implement an internal model, serializers and parsers.
Till now this has been pure Java, but we are now starting to work on C++ and would like to add even more languages.

As we don’t want to manually keep in sync all of these implementations, my idea was to describe the data format in some form and have the parsers, serializers and the model generated from that.
So the implementation only has to take care of the plumbing and the state-machine of the protocol.

In Montreal I attended a great talk on DFDL and Daffodil, so I think DFDL in general would be a great fit.
Unfortunately we don’t want to parse any data format into an XML or DOM representation for performance reasons.

My ideal workflow would look like this:

  1.  For every protocol I define the DFDL documents describing the different types of messages for a given protocol
  2.  I define multiple protocol implementation modules (one for each language)
  3.  I use a maven plugin in each of these to generate the code for that particular language from those central DFDL definitions

Is this possible?
Is it planned to support this in the future?
What other options do you see for this sort of problem?

I am absolutely willing to get my hands dirty and help implement this, if you say: “Yes we want that too but haven’t managed to do that yet”.

Chris

Re: Using DFDL to generate model, parser and generator?

Posted by Christofer Dutz <ch...@c-ware.de>.
Hi Mike,

so I converted one of my Protocols into a Xml-Schema with some utilization of the DFDL namespace (Trying to get started)
Unfortunately I'm having a little problem with how to define type inheritance ... so I have for example parameter elements which all start with a one byte type-code followed by a one byte length parameter.
The rest is completely different, based on the type of parameter. 

Seems something like this isn't DFDL:






[1] https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java


Am 10.01.19, 14:47 schrieb "Beckerle, Mike" <mb...@tresys.com>:

    This make sense to me architecturally as infrastructure means by which people use this.
    
    
    Compiling a DFDL schema into a any sort of compiled form, whether that is generated code, or just a saved runtime data structure (like we have now) is exactly what people want as a maven/sbt build step, so creating a plugin that does this is very sensible.
    
    
    Right now compiling is slow (unnecessarily. I hope we speed it up soon, and reduce it's memory footprint), so a build step that is only re-run if the schema actually changed is very useful to save time waiting around for the Daffodil compiler.
    
    
    I suggest that the generation of code from the daffodil parser/unparser data structures will push the boundaries of what anyone would call "template". This is going to be a quite sophisticated recursive descent walk, accumulating a variety of things and eventually emitting the code. I think it is totally worth it to try this though.
    
    ________________________________
    From: Christofer Dutz <ch...@c-ware.de>
    Sent: Thursday, January 10, 2019 4:57:22 AM
    To: dev@daffodil.apache.org
    Subject: Re: Using DFDL to generate model, parser and generator?
    
    Hi Mike,
    
    Well I am currently experimenting with creating a DFDL schema for one of the many protocol layers we have.
    
    I would propose the following (Please correct me, if I'm wrong):
    - We create DFDL Schemas
    - We use Daffodil to process these (Assuming that in order to process DFDL schemas, there has to be some sort of model representation)
    - We add a Maven plugin, that uses the parsed schema representation model and allows generating code via some templating language (Freemarker and Velocity are both Apache ... so should be one of these)
    - In a project you define templates for the current usecase (A general purpose runtime would be sub-optimal for our case ... we would probably use Netty utils for parsing/serializing)
    
    Perhaps based on these PLC4X templates it would make sense to build other sets of templates as part of the Daffodil project.
    Daffodil could have multiple sets of templates for different languages and frameworks. Eventually a template module could have a runtime module to be used in the code generated.
    
    So you would use the maven plugin without providing a template-artifact and it would look for local templates. If however you provide a template-artifact, then the plugin would use those.
    
    In the end I would probably build the maven plugin in a way that it makes things easier to run it on the Command line or build plugins for SBT, Gradle, Ant whatsoever ...
    
    What do you think?
    
    Chris
    
    
    
    Am 09.01.19, 20:10 schrieb "Beckerle, Mike" <mb...@tresys.com>:
    
        Christofer,
    
    
        Yes what you suggest is possible, is what many people want, has been talked about here and there, but I don't know of anyone else doing exactly this right now.
    
    
        Effectively what you are describing is a code-generator backend for Daffodil. I think this is a great idea. I personally want to have one that generates VHDL or Verilog or other Hardware synthesis language so you can go direct to an FPGA for data parsing at hardware speed.
    
    
        Anyway, such a generator would likely be adding to the existing parser/unparser primitives so that in addition to having parse() and unparse() methods, they would have generateCode() methods that emit the equivalent code, and recursively invoke the sub-objects to generateCode() that is incorporated recursively.
    
    
        I would suggest that the existing Daffodil backend, which may well not be fast enough for your needs, would nevertheless be very valuable part of your testing strategy as your schemas should work on Daffodil, and you can then verify that the parser behavior from your generated code is consistent.  It also may be helpful for diagnostic purposes - ie., if data is parsed and determined invalid, perhaps your "kit" to help your users involves parsing such data with regular old Daffodil into XML for tangibility/inspection.
    
    
        There is a fair amount of runtime-library to be created to go with the generated code of course. Daffodil has daffodil-lib, daffdil-io, daffodil-runtime1, and daffodil-runtime1-unparser, each of which contains a large volume of runtime code that would need to be replaced with C/C++ equivalent in a new runtime. I would suggest much of the work is actually here, not in the compilation.
    
    
        I really hope you undertake this effort. I think it will be a big value-add to Daffodil if it has a code-gen style backend. The current back-end really hasn't had raw-speed as its goal. It has largely been about correctness, and getting the DFDL standard fully/mostly implemented quickly. Let us know how we can help you get started.
    
    
        The other thing worth mentioning is that Daffodil does have on roadmap, plans to create a streaming parser/unparser. This would not build a DOM-tree like structure, but would instead emit events along the lines of a SAX-style parse of data. Now some formats are simply not stream-able, and there is no option to avoid building up a tree in memory. But many formats are stream-able, and people really do want the ability to parse files much larger than memory, in finite RAM, so long as the format is streamable.
    
    
        -mike beckerle
    
        Tresys Technology
    
        ________________________________
        From: Christofer Dutz <ch...@c-ware.de>
        Sent: Wednesday, January 9, 2019 8:56:28 AM
        To: dev@daffodil.apache.org
        Subject: Using DFDL to generate model, parser and generator?
    
        Hi all,
    
        I am currently looking for a solution to the following question:
    
        In the Apache PLC4X (incubating) project we are implementing a lot of different industry protocols.
        Each protocol sends packets following a particular format. For each of these we currently implement an internal model, serializers and parsers.
        Till now this has been pure Java, but we are now starting to work on C++ and would like to add even more languages.
    
        As we don’t want to manually keep in sync all of these implementations, my idea was to describe the data format in some form and have the parsers, serializers and the model generated from that.
        So the implementation only has to take care of the plumbing and the state-machine of the protocol.
    
        In Montreal I attended a great talk on DFDL and Daffodil, so I think DFDL in general would be a great fit.
        Unfortunately we don’t want to parse any data format into an XML or DOM representation for performance reasons.
    
        My ideal workflow would look like this:
    
          1.  For every protocol I define the DFDL documents describing the different types of messages for a given protocol
          2.  I define multiple protocol implementation modules (one for each language)
          3.  I use a maven plugin in each of these to generate the code for that particular language from those central DFDL definitions
    
        Is this possible?
        Is it planned to support this in the future?
        What other options do you see for this sort of problem?
    
        I am absolutely willing to get my hands dirty and help implement this, if you say: “Yes we want that too but haven’t managed to do that yet”.
    
        Chris
    
    
    


Re: Using DFDL to generate model, parser and generator?

Posted by "Beckerle, Mike" <mb...@tresys.com>.
This make sense to me architecturally as infrastructure means by which people use this.


Compiling a DFDL schema into a any sort of compiled form, whether that is generated code, or just a saved runtime data structure (like we have now) is exactly what people want as a maven/sbt build step, so creating a plugin that does this is very sensible.


Right now compiling is slow (unnecessarily. I hope we speed it up soon, and reduce it's memory footprint), so a build step that is only re-run if the schema actually changed is very useful to save time waiting around for the Daffodil compiler.


I suggest that the generation of code from the daffodil parser/unparser data structures will push the boundaries of what anyone would call "template". This is going to be a quite sophisticated recursive descent walk, accumulating a variety of things and eventually emitting the code. I think it is totally worth it to try this though.

________________________________
From: Christofer Dutz <ch...@c-ware.de>
Sent: Thursday, January 10, 2019 4:57:22 AM
To: dev@daffodil.apache.org
Subject: Re: Using DFDL to generate model, parser and generator?

Hi Mike,

Well I am currently experimenting with creating a DFDL schema for one of the many protocol layers we have.

I would propose the following (Please correct me, if I'm wrong):
- We create DFDL Schemas
- We use Daffodil to process these (Assuming that in order to process DFDL schemas, there has to be some sort of model representation)
- We add a Maven plugin, that uses the parsed schema representation model and allows generating code via some templating language (Freemarker and Velocity are both Apache ... so should be one of these)
- In a project you define templates for the current usecase (A general purpose runtime would be sub-optimal for our case ... we would probably use Netty utils for parsing/serializing)

Perhaps based on these PLC4X templates it would make sense to build other sets of templates as part of the Daffodil project.
Daffodil could have multiple sets of templates for different languages and frameworks. Eventually a template module could have a runtime module to be used in the code generated.

So you would use the maven plugin without providing a template-artifact and it would look for local templates. If however you provide a template-artifact, then the plugin would use those.

In the end I would probably build the maven plugin in a way that it makes things easier to run it on the Command line or build plugins for SBT, Gradle, Ant whatsoever ...

What do you think?

Chris



Am 09.01.19, 20:10 schrieb "Beckerle, Mike" <mb...@tresys.com>:

    Christofer,


    Yes what you suggest is possible, is what many people want, has been talked about here and there, but I don't know of anyone else doing exactly this right now.


    Effectively what you are describing is a code-generator backend for Daffodil. I think this is a great idea. I personally want to have one that generates VHDL or Verilog or other Hardware synthesis language so you can go direct to an FPGA for data parsing at hardware speed.


    Anyway, such a generator would likely be adding to the existing parser/unparser primitives so that in addition to having parse() and unparse() methods, they would have generateCode() methods that emit the equivalent code, and recursively invoke the sub-objects to generateCode() that is incorporated recursively.


    I would suggest that the existing Daffodil backend, which may well not be fast enough for your needs, would nevertheless be very valuable part of your testing strategy as your schemas should work on Daffodil, and you can then verify that the parser behavior from your generated code is consistent.  It also may be helpful for diagnostic purposes - ie., if data is parsed and determined invalid, perhaps your "kit" to help your users involves parsing such data with regular old Daffodil into XML for tangibility/inspection.


    There is a fair amount of runtime-library to be created to go with the generated code of course. Daffodil has daffodil-lib, daffdil-io, daffodil-runtime1, and daffodil-runtime1-unparser, each of which contains a large volume of runtime code that would need to be replaced with C/C++ equivalent in a new runtime. I would suggest much of the work is actually here, not in the compilation.


    I really hope you undertake this effort. I think it will be a big value-add to Daffodil if it has a code-gen style backend. The current back-end really hasn't had raw-speed as its goal. It has largely been about correctness, and getting the DFDL standard fully/mostly implemented quickly. Let us know how we can help you get started.


    The other thing worth mentioning is that Daffodil does have on roadmap, plans to create a streaming parser/unparser. This would not build a DOM-tree like structure, but would instead emit events along the lines of a SAX-style parse of data. Now some formats are simply not stream-able, and there is no option to avoid building up a tree in memory. But many formats are stream-able, and people really do want the ability to parse files much larger than memory, in finite RAM, so long as the format is streamable.


    -mike beckerle

    Tresys Technology

    ________________________________
    From: Christofer Dutz <ch...@c-ware.de>
    Sent: Wednesday, January 9, 2019 8:56:28 AM
    To: dev@daffodil.apache.org
    Subject: Using DFDL to generate model, parser and generator?

    Hi all,

    I am currently looking for a solution to the following question:

    In the Apache PLC4X (incubating) project we are implementing a lot of different industry protocols.
    Each protocol sends packets following a particular format. For each of these we currently implement an internal model, serializers and parsers.
    Till now this has been pure Java, but we are now starting to work on C++ and would like to add even more languages.

    As we don’t want to manually keep in sync all of these implementations, my idea was to describe the data format in some form and have the parsers, serializers and the model generated from that.
    So the implementation only has to take care of the plumbing and the state-machine of the protocol.

    In Montreal I attended a great talk on DFDL and Daffodil, so I think DFDL in general would be a great fit.
    Unfortunately we don’t want to parse any data format into an XML or DOM representation for performance reasons.

    My ideal workflow would look like this:

      1.  For every protocol I define the DFDL documents describing the different types of messages for a given protocol
      2.  I define multiple protocol implementation modules (one for each language)
      3.  I use a maven plugin in each of these to generate the code for that particular language from those central DFDL definitions

    Is this possible?
    Is it planned to support this in the future?
    What other options do you see for this sort of problem?

    I am absolutely willing to get my hands dirty and help implement this, if you say: “Yes we want that too but haven’t managed to do that yet”.

    Chris



Re: Using DFDL to generate model, parser and generator?

Posted by Christofer Dutz <ch...@c-ware.de>.
Hi Mike,

Well I am currently experimenting with creating a DFDL schema for one of the many protocol layers we have. 

I would propose the following (Please correct me, if I'm wrong):
- We create DFDL Schemas
- We use Daffodil to process these (Assuming that in order to process DFDL schemas, there has to be some sort of model representation)
- We add a Maven plugin, that uses the parsed schema representation model and allows generating code via some templating language (Freemarker and Velocity are both Apache ... so should be one of these)
- In a project you define templates for the current usecase (A general purpose runtime would be sub-optimal for our case ... we would probably use Netty utils for parsing/serializing)

Perhaps based on these PLC4X templates it would make sense to build other sets of templates as part of the Daffodil project.
Daffodil could have multiple sets of templates for different languages and frameworks. Eventually a template module could have a runtime module to be used in the code generated.

So you would use the maven plugin without providing a template-artifact and it would look for local templates. If however you provide a template-artifact, then the plugin would use those.

In the end I would probably build the maven plugin in a way that it makes things easier to run it on the Command line or build plugins for SBT, Gradle, Ant whatsoever ...

What do you think?

Chris



Am 09.01.19, 20:10 schrieb "Beckerle, Mike" <mb...@tresys.com>:

    Christofer,
    
    
    Yes what you suggest is possible, is what many people want, has been talked about here and there, but I don't know of anyone else doing exactly this right now.
    
    
    Effectively what you are describing is a code-generator backend for Daffodil. I think this is a great idea. I personally want to have one that generates VHDL or Verilog or other Hardware synthesis language so you can go direct to an FPGA for data parsing at hardware speed.
    
    
    Anyway, such a generator would likely be adding to the existing parser/unparser primitives so that in addition to having parse() and unparse() methods, they would have generateCode() methods that emit the equivalent code, and recursively invoke the sub-objects to generateCode() that is incorporated recursively.
    
    
    I would suggest that the existing Daffodil backend, which may well not be fast enough for your needs, would nevertheless be very valuable part of your testing strategy as your schemas should work on Daffodil, and you can then verify that the parser behavior from your generated code is consistent.  It also may be helpful for diagnostic purposes - ie., if data is parsed and determined invalid, perhaps your "kit" to help your users involves parsing such data with regular old Daffodil into XML for tangibility/inspection.
    
    
    There is a fair amount of runtime-library to be created to go with the generated code of course. Daffodil has daffodil-lib, daffdil-io, daffodil-runtime1, and daffodil-runtime1-unparser, each of which contains a large volume of runtime code that would need to be replaced with C/C++ equivalent in a new runtime. I would suggest much of the work is actually here, not in the compilation.
    
    
    I really hope you undertake this effort. I think it will be a big value-add to Daffodil if it has a code-gen style backend. The current back-end really hasn't had raw-speed as its goal. It has largely been about correctness, and getting the DFDL standard fully/mostly implemented quickly. Let us know how we can help you get started.
    
    
    The other thing worth mentioning is that Daffodil does have on roadmap, plans to create a streaming parser/unparser. This would not build a DOM-tree like structure, but would instead emit events along the lines of a SAX-style parse of data. Now some formats are simply not stream-able, and there is no option to avoid building up a tree in memory. But many formats are stream-able, and people really do want the ability to parse files much larger than memory, in finite RAM, so long as the format is streamable.
    
    
    -mike beckerle
    
    Tresys Technology
    
    ________________________________
    From: Christofer Dutz <ch...@c-ware.de>
    Sent: Wednesday, January 9, 2019 8:56:28 AM
    To: dev@daffodil.apache.org
    Subject: Using DFDL to generate model, parser and generator?
    
    Hi all,
    
    I am currently looking for a solution to the following question:
    
    In the Apache PLC4X (incubating) project we are implementing a lot of different industry protocols.
    Each protocol sends packets following a particular format. For each of these we currently implement an internal model, serializers and parsers.
    Till now this has been pure Java, but we are now starting to work on C++ and would like to add even more languages.
    
    As we don’t want to manually keep in sync all of these implementations, my idea was to describe the data format in some form and have the parsers, serializers and the model generated from that.
    So the implementation only has to take care of the plumbing and the state-machine of the protocol.
    
    In Montreal I attended a great talk on DFDL and Daffodil, so I think DFDL in general would be a great fit.
    Unfortunately we don’t want to parse any data format into an XML or DOM representation for performance reasons.
    
    My ideal workflow would look like this:
    
      1.  For every protocol I define the DFDL documents describing the different types of messages for a given protocol
      2.  I define multiple protocol implementation modules (one for each language)
      3.  I use a maven plugin in each of these to generate the code for that particular language from those central DFDL definitions
    
    Is this possible?
    Is it planned to support this in the future?
    What other options do you see for this sort of problem?
    
    I am absolutely willing to get my hands dirty and help implement this, if you say: “Yes we want that too but haven’t managed to do that yet”.
    
    Chris
    


Re: Using DFDL to generate model, parser and generator?

Posted by "Beckerle, Mike" <mb...@tresys.com>.
Christofer,


Yes what you suggest is possible, is what many people want, has been talked about here and there, but I don't know of anyone else doing exactly this right now.


Effectively what you are describing is a code-generator backend for Daffodil. I think this is a great idea. I personally want to have one that generates VHDL or Verilog or other Hardware synthesis language so you can go direct to an FPGA for data parsing at hardware speed.


Anyway, such a generator would likely be adding to the existing parser/unparser primitives so that in addition to having parse() and unparse() methods, they would have generateCode() methods that emit the equivalent code, and recursively invoke the sub-objects to generateCode() that is incorporated recursively.


I would suggest that the existing Daffodil backend, which may well not be fast enough for your needs, would nevertheless be very valuable part of your testing strategy as your schemas should work on Daffodil, and you can then verify that the parser behavior from your generated code is consistent.  It also may be helpful for diagnostic purposes - ie., if data is parsed and determined invalid, perhaps your "kit" to help your users involves parsing such data with regular old Daffodil into XML for tangibility/inspection.


There is a fair amount of runtime-library to be created to go with the generated code of course. Daffodil has daffodil-lib, daffdil-io, daffodil-runtime1, and daffodil-runtime1-unparser, each of which contains a large volume of runtime code that would need to be replaced with C/C++ equivalent in a new runtime. I would suggest much of the work is actually here, not in the compilation.


I really hope you undertake this effort. I think it will be a big value-add to Daffodil if it has a code-gen style backend. The current back-end really hasn't had raw-speed as its goal. It has largely been about correctness, and getting the DFDL standard fully/mostly implemented quickly. Let us know how we can help you get started.


The other thing worth mentioning is that Daffodil does have on roadmap, plans to create a streaming parser/unparser. This would not build a DOM-tree like structure, but would instead emit events along the lines of a SAX-style parse of data. Now some formats are simply not stream-able, and there is no option to avoid building up a tree in memory. But many formats are stream-able, and people really do want the ability to parse files much larger than memory, in finite RAM, so long as the format is streamable.


-mike beckerle

Tresys Technology

________________________________
From: Christofer Dutz <ch...@c-ware.de>
Sent: Wednesday, January 9, 2019 8:56:28 AM
To: dev@daffodil.apache.org
Subject: Using DFDL to generate model, parser and generator?

Hi all,

I am currently looking for a solution to the following question:

In the Apache PLC4X (incubating) project we are implementing a lot of different industry protocols.
Each protocol sends packets following a particular format. For each of these we currently implement an internal model, serializers and parsers.
Till now this has been pure Java, but we are now starting to work on C++ and would like to add even more languages.

As we don’t want to manually keep in sync all of these implementations, my idea was to describe the data format in some form and have the parsers, serializers and the model generated from that.
So the implementation only has to take care of the plumbing and the state-machine of the protocol.

In Montreal I attended a great talk on DFDL and Daffodil, so I think DFDL in general would be a great fit.
Unfortunately we don’t want to parse any data format into an XML or DOM representation for performance reasons.

My ideal workflow would look like this:

  1.  For every protocol I define the DFDL documents describing the different types of messages for a given protocol
  2.  I define multiple protocol implementation modules (one for each language)
  3.  I use a maven plugin in each of these to generate the code for that particular language from those central DFDL definitions

Is this possible?
Is it planned to support this in the future?
What other options do you see for this sort of problem?

I am absolutely willing to get my hands dirty and help implement this, if you say: “Yes we want that too but haven’t managed to do that yet”.

Chris