You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by Patrick GRANDJEAN <pa...@yahoo.fr.INVALID> on 2018/12/11 22:01:35 UTC

DFDL & Input Formats

Hi !
My name is Patrick and I have recently attended a Spark meetup in Boston where Mike Beckerle has presented Apache Daffodil (incubating). I work for a company that has to deal with many data formats, both new (JSON, XML, YAML, Protobuf, etc) and old (EDIFACT, IATA formats, etc). More recently, we have started to process files in Hadoop and we have developed "input formats" for each data format. Basically, an input format tells Hadoop how a file can be split into smaller parts to be processed.

To give an example, let's consider a huge XML having the following structure:
<root schemaVersion="1.2.3">  <transaction>...</transaction>  <transaction>...</transaction>  ...  <transaction>...</transaction>
</root>

Each transaction needs to be processed individually. An input format can split such XML into a list of valid XMLs, each containing a single <transaction>:
<root schemaVersion="1.2.3">  <transaction>...</transaction></root><root schemaVersion="1.2.3">  <transaction>...</transaction></root>...
It is not necessary to completely parse the XML at that moment, only to split into smaller pieces. Therefore, parsing <transaction> can be bypassed, except to detect the closing tag </transaction>.
I was wondering if DFDL has such a concept of splittable file. If not, would it be interesting to add it? The main advantage I see is: if DFDL can describe a data format and how to split it, then one could use a generic Hadoop input format to process files using DFDL. In other words, in addition to parsers and unparsers, users could have Hadoop input formats for (almost?) free.
Please let me know if this idea makes sense in the context of Apache Daffodil. I would love to discuss this further.

Kind Regards,Patrick.

Re: DFDL & Input Format

Posted by Mike Beckerle <mb...@tresys.com>.
Patrick,


So, ability to split fast is dependent on data format and behavior of the data.


In the XML example you gave, you are depending on fast scan for the terminator "</transaction>", with no way that gets tripped up by quoting problems like that string appearing within the data itself. Pretty safe assumption for XML usually.


Consider  https://github.com/DFDLSchemas/GeoNames


GeoNames is a really good candidate for exactly the sort of fast-split-up you are taking about. It's 2+ gigabyte file compressed, and the data stream could be rapidly split up.


So I modified the
https://github.com/OpenDFDL/daffodil-spark

example. It now processes GeoNames data using spark, and that should show you how to "fast parse" data.  The geonames data is a file of "quasi-XML" data that needs to be massaged back into real XML form, and it's really big. The "test" for geonames here reads a compressed geonames data file (small sample included), and writes out a compressed spark RDD as files.

I think it illustrates that you have to drive the parser sequentially to split the data, but then after that all subsequent processing (in this case, assembling the fragments of quasi-XML into an actual piece of well-behaved XML) is spark-parallel work.

Give it a look see. I'm not much of a spark expert but I *think* this is going to create parallel data as fast as daffodil can separate it off.

-mike beckerle
<https://github.com/DFDLSchemas/>

<https://github.com/DFDLSchemas/>


________________________________
From: Mike Beckerle <mb...@tresys.com>
Sent: Wednesday, December 12, 2018 11:50 AM
To: dev@daffodil.apache.org; Patrick GRANDJEAN; users@daffodil.apache.org
Subject: Re: DFDL & Input Formats


Thanks for raising this Patrick.


I'm CCing this to users@daffodil.apache.org and will respond there first about ways to use Daffodil to do this.


If we have to discuss adding features/APIs to Daffodil (which we might), then that would make sense for here on the dev list.


There's no notion in DFDL of "splittable", that matches exactly the concept you want, but there are techniques to discuss.



________________________________
From: Patrick GRANDJEAN <pa...@yahoo.fr.INVALID>
Sent: Tuesday, December 11, 2018 5:01:35 PM
To: dev@daffodil.apache.org
Subject: DFDL & Input Formats

Hi !
My name is Patrick and I have recently attended a Spark meetup in Boston where Mike Beckerle has presented Apache Daffodil (incubating). I work for a company that has to deal with many data formats, both new (JSON, XML, YAML, Protobuf, etc) and old (EDIFACT, IATA formats, etc). More recently, we have started to process files in Hadoop and we have developed "input formats" for each data format. Basically, an input format tells Hadoop how a file can be split into smaller parts to be processed.

To give an example, let's consider a huge XML having the following structure:
<root schemaVersion="1.2.3">  <transaction>...</transaction>  <transaction>...</transaction>  ...  <transaction>...</transaction>
</root>

Each transaction needs to be processed individually. An input format can split such XML into a list of valid XMLs, each containing a single <transaction>:
<root schemaVersion="1.2.3">  <transaction>...</transaction></root><root schemaVersion="1.2.3">  <transaction>...</transaction></root>...
It is not necessary to completely parse the XML at that moment, only to split into smaller pieces. Therefore, parsing <transaction> can be bypassed, except to detect the closing tag </transaction>.
I was wondering if DFDL has such a concept of splittable file. If not, would it be interesting to add it? The main advantage I see is: if DFDL can describe a data format and how to split it, then one could use a generic Hadoop input format to process files using DFDL. In other words, in addition to parsers and unparsers, users could have Hadoop input formats for (almost?) free.
Please let me know if this idea makes sense in the context of Apache Daffodil. I would love to discuss this further.

Kind Regards,Patrick.

Re: DFDL & Input Formats

Posted by Mike Beckerle <mb...@tresys.com>.
Thanks for raising this Patrick.


I'm CCing this to users@daffodil.apache.org and will respond there first about ways to use Daffodil to do this.


If we have to discuss adding features/APIs to Daffodil (which we might), then that would make sense for here on the dev list.


There's no notion in DFDL of "splittable", that matches exactly the concept you want, but there are techniques to discuss.



________________________________
From: Patrick GRANDJEAN <pa...@yahoo.fr.INVALID>
Sent: Tuesday, December 11, 2018 5:01:35 PM
To: dev@daffodil.apache.org
Subject: DFDL & Input Formats

Hi !
My name is Patrick and I have recently attended a Spark meetup in Boston where Mike Beckerle has presented Apache Daffodil (incubating). I work for a company that has to deal with many data formats, both new (JSON, XML, YAML, Protobuf, etc) and old (EDIFACT, IATA formats, etc). More recently, we have started to process files in Hadoop and we have developed "input formats" for each data format. Basically, an input format tells Hadoop how a file can be split into smaller parts to be processed.

To give an example, let's consider a huge XML having the following structure:
<root schemaVersion="1.2.3">  <transaction>...</transaction>  <transaction>...</transaction>  ...  <transaction>...</transaction>
</root>

Each transaction needs to be processed individually. An input format can split such XML into a list of valid XMLs, each containing a single <transaction>:
<root schemaVersion="1.2.3">  <transaction>...</transaction></root><root schemaVersion="1.2.3">  <transaction>...</transaction></root>...
It is not necessary to completely parse the XML at that moment, only to split into smaller pieces. Therefore, parsing <transaction> can be bypassed, except to detect the closing tag </transaction>.
I was wondering if DFDL has such a concept of splittable file. If not, would it be interesting to add it? The main advantage I see is: if DFDL can describe a data format and how to split it, then one could use a generic Hadoop input format to process files using DFDL. In other words, in addition to parsers and unparsers, users could have Hadoop input formats for (almost?) free.
Please let me know if this idea makes sense in the context of Apache Daffodil. I would love to discuss this further.

Kind Regards,Patrick.

Re: DFDL & Input Formats

Posted by Mike Beckerle <mb...@tresys.com>.
Thanks for raising this Patrick.


I'm CCing this to users@daffodil.apache.org and will respond there first about ways to use Daffodil to do this.


If we have to discuss adding features/APIs to Daffodil (which we might), then that would make sense for here on the dev list.


There's no notion in DFDL of "splittable", that matches exactly the concept you want, but there are techniques to discuss.



________________________________
From: Patrick GRANDJEAN <pa...@yahoo.fr.INVALID>
Sent: Tuesday, December 11, 2018 5:01:35 PM
To: dev@daffodil.apache.org
Subject: DFDL & Input Formats

Hi !
My name is Patrick and I have recently attended a Spark meetup in Boston where Mike Beckerle has presented Apache Daffodil (incubating). I work for a company that has to deal with many data formats, both new (JSON, XML, YAML, Protobuf, etc) and old (EDIFACT, IATA formats, etc). More recently, we have started to process files in Hadoop and we have developed "input formats" for each data format. Basically, an input format tells Hadoop how a file can be split into smaller parts to be processed.

To give an example, let's consider a huge XML having the following structure:
<root schemaVersion="1.2.3">  <transaction>...</transaction>  <transaction>...</transaction>  ...  <transaction>...</transaction>
</root>

Each transaction needs to be processed individually. An input format can split such XML into a list of valid XMLs, each containing a single <transaction>:
<root schemaVersion="1.2.3">  <transaction>...</transaction></root><root schemaVersion="1.2.3">  <transaction>...</transaction></root>...
It is not necessary to completely parse the XML at that moment, only to split into smaller pieces. Therefore, parsing <transaction> can be bypassed, except to detect the closing tag </transaction>.
I was wondering if DFDL has such a concept of splittable file. If not, would it be interesting to add it? The main advantage I see is: if DFDL can describe a data format and how to split it, then one could use a generic Hadoop input format to process files using DFDL. In other words, in addition to parsers and unparsers, users could have Hadoop input formats for (almost?) free.
Please let me know if this idea makes sense in the context of Apache Daffodil. I would love to discuss this further.

Kind Regards,Patrick.