You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by Steve Lawrence <sl...@apache.org> on 2018/04/03 15:49:36 UTC

Re: large PCAP files and changes to message-streaming API for onPath() support vs. SAX behavior

It sounds to me like you're suggesting something like the following:

  val headerDP = pf.onPath(/PCAPHeader)
  val packetDP = pf.onPath(/Packet)

  val headerPR = headerDP.parse(input)
  while (true) {
    val packetPR = packetDP.parse(input, headerPR)
    ...
  }

So rather than supporting SAX style events and being able to throw away
old data, the user knows information about the data, is able to split it
up into mutliple data processor, and can use the parse result (or
something) from the first parse to access state information?

I think we could support something like the above using separate schemas
and variables, without the need to support onPath.

  val headerDP = compiler.compile("header.dfdl.xsd").onPath("/")
  val packetDP = compiler.compile("packet.dfdl.xsd").onPath("/")

  val headerPR = headerDP.parse(input)
  val byteOrder = headerPR.infoset / "byteOrder"

  packetDP.setVariable("byteOrder", byteOrder)
  while (true) {
    val packetPR = packetDP.parse(input)
  }

The only new thing that we need to support the above is the ability to
pass the same input to multiple parse calls and continue where the
previous parse left off, which was discussed a while ago in "stremaing
API design review" [1].

If this is all correct and meets the use cases you discuss, I would say
effort would be better spent focusing on supporting the same input going
to multiple parses, and the SAX behavior for when such a split like
above isn't so easy/clear.

- Steve

[1]
https://lists.apache.org/thread.html/e49ecbacef1e9f89817e3fc7438b73733dc76b629ccf003f8f529d8e@%3Cdev.daffodil.apache.org%3E







On 03/29/2018 12:03 PM, Mike Beckerle wrote:
> I was recently asked about Daffodil and processing gigantic (multi-gigabyte) PCAP files.
> 
> 
> We have two features planned for 2.2.0 which help with memory footprint issues. BLOBs, and the message-streaming feature.
> 
> 
> At first glance, neither will help with giant PCAP files. It's a large file that is not large because of a binary BLOB in it like an image or video file would be, but it also is a single giant tree, not a stream of messages all with the same root element.
> 
> 
> This use case was the justification for requesting a SAX-style true event-driven behavior for Daffodil.
> 
> 
> Long term that's great, but SAX is complex to implement given DFDL and points-of-uncertainty/backtracking in the parser, so I wanted to explore whether with some small API changes we could dodge this SAX-bullet at least for PCAP.
> 
> 
> So for PCAP,  the file consists of a global header, and then a bunch of packets. The packets are exactly like a message stream, if we could just skip past the header while keeping the state we need from it, like the byte order, then a message streaming pull-type parser would be ideal.
> 
> 
> (We would also need the symmetric unparser behavior)
> 
> 
> Our ProcessorFactory method pf.onPath("/Packets") would in theory be usable with the message streaming API to sequence through just the packets with each parse call returning the Infoset for one packet. The path given to pf.onPath is supposed to be a path to an array element, relative to the root element that the PF was compiled for.
> 
> 
> What is involved in implementing pf.onPath(...),  that actually steps downward into the data stream to skip past some material before beginning the iteration?
> 
> 
> For unparsing, things aren't quite so symmetric. We need there to provide the infoset events for the part of the data we're skipping past with the onPath(...).
> 
> 
> It would probably be sufficient to implement a very simple subset of path expressions. E.g., only to first level arrays, not arrays within arrays, not to anything inside of a nested point of uncertainty, etc.
> 
> 
> If this is easy this may allow us to postpone the SAX stuff longer. If this is complex, then I would guess it isn't worth it and we should just go for true event-style parse and unparse.
> 
> 
> Thoughts?
> 
> 


Re: large PCAP files and changes to message-streaming API for onPath() support vs. SAX behavior

Posted by Mike Beckerle <mb...@tresys.com>.
Your analysis looks right to me. The key idea is the same input is read by multiple parsers in sequence. But it isn't the java.io/java.nio input, as the handoff from one parser to another could be on any bit boundary.


So it's the DataInputStream really that has to be handed off. It could be the entire PState perhaps, but then the current infoset node would have to make sense, etc. which it may not.


This same idea also would have to work for unparsing.



________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Tuesday, April 3, 2018 11:49:36 AM
To: dev@daffodil.apache.org; Mike Beckerle
Subject: Re: large PCAP files and changes to message-streaming API for onPath() support vs. SAX behavior

It sounds to me like you're suggesting something like the following:

  val headerDP = pf.onPath(/PCAPHeader)
  val packetDP = pf.onPath(/Packet)

  val headerPR = headerDP.parse(input)
  while (true) {
    val packetPR = packetDP.parse(input, headerPR)
    ...
  }

So rather than supporting SAX style events and being able to throw away
old data, the user knows information about the data, is able to split it
up into mutliple data processor, and can use the parse result (or
something) from the first parse to access state information?

I think we could support something like the above using separate schemas
and variables, without the need to support onPath.

  val headerDP = compiler.compile("header.dfdl.xsd").onPath("/")
  val packetDP = compiler.compile("packet.dfdl.xsd").onPath("/")

  val headerPR = headerDP.parse(input)
  val byteOrder = headerPR.infoset / "byteOrder"

  packetDP.setVariable("byteOrder", byteOrder)
  while (true) {
    val packetPR = packetDP.parse(input)
  }

The only new thing that we need to support the above is the ability to
pass the same input to multiple parse calls and continue where the
previous parse left off, which was discussed a while ago in "stremaing
API design review" [1].

If this is all correct and meets the use cases you discuss, I would say
effort would be better spent focusing on supporting the same input going
to multiple parses, and the SAX behavior for when such a split like
above isn't so easy/clear.

- Steve

[1]
https://lists.apache.org/thread.html/e49ecbacef1e9f89817e3fc7438b73733dc76b629ccf003f8f529d8e@%3Cdev.daffodil.apache.org%3E







On 03/29/2018 12:03 PM, Mike Beckerle wrote:
> I was recently asked about Daffodil and processing gigantic (multi-gigabyte) PCAP files.
>
>
> We have two features planned for 2.2.0 which help with memory footprint issues. BLOBs, and the message-streaming feature.
>
>
> At first glance, neither will help with giant PCAP files. It's a large file that is not large because of a binary BLOB in it like an image or video file would be, but it also is a single giant tree, not a stream of messages all with the same root element.
>
>
> This use case was the justification for requesting a SAX-style true event-driven behavior for Daffodil.
>
>
> Long term that's great, but SAX is complex to implement given DFDL and points-of-uncertainty/backtracking in the parser, so I wanted to explore whether with some small API changes we could dodge this SAX-bullet at least for PCAP.
>
>
> So for PCAP,  the file consists of a global header, and then a bunch of packets. The packets are exactly like a message stream, if we could just skip past the header while keeping the state we need from it, like the byte order, then a message streaming pull-type parser would be ideal.
>
>
> (We would also need the symmetric unparser behavior)
>
>
> Our ProcessorFactory method pf.onPath("/Packets") would in theory be usable with the message streaming API to sequence through just the packets with each parse call returning the Infoset for one packet. The path given to pf.onPath is supposed to be a path to an array element, relative to the root element that the PF was compiled for.
>
>
> What is involved in implementing pf.onPath(...),  that actually steps downward into the data stream to skip past some material before beginning the iteration?
>
>
> For unparsing, things aren't quite so symmetric. We need there to provide the infoset events for the part of the data we're skipping past with the onPath(...).
>
>
> It would probably be sufficient to implement a very simple subset of path expressions. E.g., only to first level arrays, not arrays within arrays, not to anything inside of a nested point of uncertainty, etc.
>
>
> If this is easy this may allow us to postpone the SAX stuff longer. If this is complex, then I would guess it isn't worth it and we should just go for true event-style parse and unparse.
>
>
> Thoughts?
>
>