You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Charles Givre <cg...@gmail.com> on 2019/11/07 17:58:43 UTC

Re: Use cases for DFDL

Hi Steve, 
Thanks for responding... Here's how Drill reads a file:

Drill uses what are called "format plugins" which basically read the file in question and map fields to column vectors.  Note:  Drill supports nested data structures, so a column could contain a MAP or LIST. 

The basic steps are:
1.  Open the inputstream and read the file
2.  If the schema is known, it is advantageous to define the schema using a schemaBuilder object in advance and create schemaWriters for each column.  In this case, since we'd be using DFDL, we do know the schema so we could create the schema BEFORE the data actually gets read.  If the schema is not known in advance, JSON for instance, Drill can discover the schema as it is reading the data, by dynamically adding column vectors as data is ingested, but that's not the case here... 
3.  Once the schema is defined, Drill will then read the file row by row, parse the data, and assign values to each column vector. 

There are a few more details but that's the essence.  

What would be great is if we could create a function that could directly map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does natively support JSON, however, it would probably be more effective and efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need some sort of Iterable object so that Drill can map the parsed fields to the schema.  

If you want to take a look at a relatively simple format plugin take a look here: [2]. This file is the BatchReader which is where most of the heavy lifting takes place.  This plugin is for ESRI Shape files and has a mix of pre-defined fields, nested fields and fields that are defined after reading starts.


[1]: https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
[2]: https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>


I can start a draft PR on the Drill side over the weekend and will share the link to this list.
Respectfully, 
-- C


> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <st...@gmail.com> wrote:
> 
> I definitely agree. Apache Drill seems like a logical place to add
> Daffodil support. And I'm sure many of us, including myself, would be
> happy to provide some time towards this effort.
> 
> The Daffodil API is actually fairly simple and is usually fairly
> straightforward to integrate--most of the complexity comes from the DFDL
> schemas. There's a good "hello world" available [1] that shows more API
> functionality/errors/etc., but the jist of it is:
> 
> 1) Compile a DFDL schema to a data processor:
> 
>  Compiler c = Daffodil.compiler();
>  ProcessorFactory pf = c.compileFile(file);
>  DataProcessor dp = pf.onPath("/");
> 
> 2) Create an input source for the data
> 
>  InputStream is = ...
>  InputSourceDataInputStream in = new InputSourceDataInputStream(is);
> 
> 3) Create an infoset outputter (we have a handful of differnt kinds)
> 
>  JDOMInfosetOutputter out = new JDOMInfosetOutputter();
> 
> 4) Use the DataProcessor to parse the input data to the infoset outputter
> 
>  ParseResult pr = dataProcessor.parse(in, out)
> 
> So I guess the parts that we would need more Drill understanding is what
> the InfosetOutputter (step 3) needs to look like to better integrate
> into Drill. Is there a standard data structure that Drill expects
> representations of data to look like and Drill does the querying on the
> data structure? And is there some sort of schema that Daffodil would
> need to create to describe what this structure looks like so it could
> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
> this data structure, unless Drill already supports XML or JSON.
> 
> Or is it completely up to the Storage Plugin (is that the right term) to
> determine how to take a Drill query and find the appropriate data from
> the data store?
> 
> - Steve
> 
> [1]
> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
> 
> 
> On 11/3/19 9:31 AM, Charles Givre wrote:
>> Hi Julian,
>> It seems like there is a beginning of convergence of the minds here.  I went to 
>> the Apache Roadshow in DC and that was where I learned about DFDL and 
>> immediately thought this was a really interesting possibility.
>> 
>> I'd love to see if we could foster some collaboration between the various 
>> projects on this.  From the Drill side of things, it would make it SO much 
>> easier to get Drill to read (and by extension query) various data types.  I'd be 
>> willing to contribute time from the Drill side, but I definitely will need help 
>> understanding how DFDL works.
>> 
>> --C
>> 
>> 
>> 
>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feinauer@pragmaticminds.de 
>>> <ma...@pragmaticminds.de>> wrote:
>>> 
>>> Hi Charles,
>>> this is an interesting idea and in fact we also discussed the same matter for 
>>> Calcite at ApacheCon NA.
>>> But, I agree that it would be really powerful together with a complete Runtime 
>>> like Drill.
>>> Julian
>>> *Von:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>> *Antworten an:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
>>> *An:*"Costello, Roger L." <costello@mitre.org <ma...@mitre.org>>
>>> *Cc:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>> *Betreff:*Re: Use cases for DFDL
>>> +1
>>> 
>>> 
>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <costello@mitre.org 
>>>> <ma...@mitre.org>> wrote:
>>>> Excellent! Okay, here’s the use case:
>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>> use ANSI SQL to query the data, join it with other data, do analysis, etc., 
>>>> just as if it came from a database. So, instead of parsing data to XML and 
>>>> then using XPath to pull out data, you could instead parse data to Apache 
>>>> Drill's data representation and then use ANSI SQL to pull out data, and even 
>>>> combine it with other non-Daffodil data types. The advantage for this would 
>>>> be that it would make it very easy to enable Drill to query new data types 
>>>> (IE simply by using a DFDL schema) and it would enable users to easily query 
>>>> this data without having to load it into another system.
>>>> How’s that Charles?
>>>> /Roger
>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It is 
>>>> regular ANSI SQL.  IMHO, I think this. would be a really great collaboration 
>>>> of the two communities.
>>>> --C
>>>> 
>>>> 
>>>> 
>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <costello@mitre.org 
>>>>> <ma...@mitre.org>> wrote:
>>>>> Thanks again Charles. Is the following use case description correct?
>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts of 
>>>>> that data, join it with other data, do analysis, etc., just as if it came 
>>>>> from a database. So, instead of parsing data to XML and then using XPath to 
>>>>> pull out data, you could instead parse data to Apache Drill's data 
>>>>> representation and then use Drills rich data-query capabilities to pull out 
>>>>> data, and even combine it with other non-Daffodil data types. The advantage 
>>>>> for this would be that it would make it very easy to enable Drill to query 
>>>>> new data types (IE simply by using a DFDL schema) and it would enable users 
>>>>> to easily query this data without having to load it into another system.
>>>>> Is that correct?
>>>>> /Roger
>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>> Not exactly...
>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
>>>>> plugin could be written for Drill that mirrors this schema and ultimately 
>>>>> reads the data files.  Drill wouldn't be populating any database, but rather 
>>>>> directly querying the data.
>>>>> The advantage for this would be that it would make it very easy to enable 
>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
>>>>> would enable users to easily query this data w/o having to load it into 
>>>>> another system.  Does that make sense?
>>>>> -- C
>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <costello@mitre.org 
>>>>>> <ma...@mitre.org>> wrote:
>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill to 
>>>>>> query the database.
>>>>>> Is that correct?
>>>>>> /Roger
>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think a 
>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to enable 
>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
>>>>>> applied to other SQL query engines such as Presto and/or Impala.
>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
>>>>>> -- C
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <costello@mitre.org 
>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>> Thanks Mike! I updated the slide:
>>>>>>> <image002.png>
>>>>>>> *From:*Beckerle, Mike <mbeckerle@tresys.com <ma...@tresys.com>>
>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>> I would not pick on RDF data stores as the target.
>>>>>>> Parsing data to populate a database (any variety) is the actual case. The 
>>>>>>> fact that we did do one project involving RDF is why I cited that example 
>>>>>>> in particular but pulling data into any data store/data base begins with 
>>>>>>> the ability to parse the data, and then process it into suitable form.
>>>>>>> This is an incomplete list so perhaps this slide title should be "Example 
>>>>>>> Use Cases for DFDL" ?
>>>>>>> ...mikeb
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> *From:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
>>>>>>> *To:*users@daffodil.apache.org 
>>>>>>> <ma...@daffodil.apache.org><users@daffodil.apache.org 
>>>>>>> <ma...@daffodil.apache.org>>
>>>>>>> *Subject:*Use cases for DFDL
>>>>>>> Hi Folks,
>>>>>>> I created a slide of use cases. See below. Do you agree with the slide? 
>>>>>>> Anything you would add, delete, or change?  /Roger
>>>>>>> <image003.png>
>> 
>

Re: Use cases for DFDL

Posted by Charles Givre <cg...@gmail.com>.

Ok... That makes sense.
Do you know if there's some documentation about that new feature?
-- C

> On Nov 7, 2019, at 1:57 PM, Paul Rogers <pa...@yahoo.com> wrote:
> 
> Hi Charles,
> 
> 
> Your suggestion to read the schema in each reader can work. In this case, the planner knows nothing about the schema; it is discovered at scan time, by each reader, as the file is read.
> 
> 
> Let's take a step back. Drill is designed for big data distributed processing. We might imagine having 100+ files of some DFDL format on HDFS, with, say, 10+ Drillbits reading those files in using, say, 50 scan operators. in separate threads (minor fragments.)
> 
> 
> My hunch is that, since the schema is the same for all files, it would be more efficient to read the schema at plan time, then pass the schema along as part of the "physical plan" to each scan operator. That way, in the scenario above, the schema would be read once (by the planner) rather than 100 times (by each reader in each scan operator.)
> 
> 
> Further, Drill would know the type of the columns which can avoid ambiguities that occur when types are unknown.
> 
> 
> Arina recently added schema support via a "provided schema." We passed this information to the CSV reader so it can operate with a schema. Perhaps we can look at what Arina did and figure out something similar for this use case. Or, maybe even use the DFDL schema in place of the "provided" schema. Someone will need to poke around a bit to figure out the best answer.
> 
> 
> Thanks,
> 
> - Paul
> 
> 
> 
> On Thursday, November 7, 2019, 10:40:39 AM PST, Charles Givre <cg...@gmail.com> wrote:
> 
> 
> @Paul, 
> Do you think a format plugin is the right way to integrate this?  My thought was that we could create a folder for dfdl schemata, then the format plugin could specify which schema would be used during read.  IE:
> 
> "dfdl" :{
>   "type":"dfdl",
>   "file":"myschema.dfdl",
>   "extensions":["xml"]
> }
> 
> I was envisioning this working in much the same way as other format plugins that use an external parser.
> -- C
> 
> 
> > On Nov 7, 2019, at 1:35 PM, Paul Rogers <par0328@yahoo.com.INVALID <ma...@yahoo.com.INVALID>> wrote:
> > 
> > Hi All,
> > 
> > One thought to add is that if DFDL defines the file schema, then it would be ideal to use that schema at plan time as well as run time. Drill's Calcite integration provides means to do this, though I am personally a bit hazy on the details.
> > 
> > Certainly getting the reader to work is the first step; thanks Charles for the excellent summary. Then, add the needed Calcite integration to make the schema available to the planner at plan time.
> > 
> > Thanks,
> > - Paul
> > 
> > 
> > 
> >    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre <cgivre@gmail.com <ma...@gmail.com>> wrote:  
> > 
> > Hi Steve, 
> > Thanks for responding... Here's how Drill reads a file:
> > 
> > Drill uses what are called "format plugins" which basically read the file in question and map fields to column vectors.  Note:  Drill supports nested data structures, so a column could contain a MAP or LIST. 
> > 
> > The basic steps are:
> > 1.  Open the inputstream and read the file
> > 2.  If the schema is known, it is advantageous to define the schema using a schemaBuilder object in advance and create schemaWriters for each column.  In this case, since we'd be using DFDL, we do know the schema so we could create the schema BEFORE the data actually gets read.  If the schema is not known in advance, JSON for instance, Drill can discover the schema as it is reading the data, by dynamically adding column vectors as data is ingested, but that's not the case here... 
> > 3.  Once the schema is defined, Drill will then read the file row by row, parse the data, and assign values to each column vector. 
> > 
> > There are a few more details but that's the essence.  
> > 
> > What would be great is if we could create a function that could directly map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does natively support JSON, however, it would probably be more effective and efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need some sort of Iterable object so that Drill can map the parsed fields to the schema.  
> > 
> > If you want to take a look at a relatively simple format plugin take a look here: [2]. This file is the BatchReader which is where most of the heavy lifting takes place.  This plugin is for ESRI Shape files and has a mix of pre-defined fields, nested fields and fields that are defined after reading starts.
> > 
> > 
> > [1]: https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md  <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md><https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>>
> > [2]: https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java  <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java><https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>>
> > 
> > 
> > I can start a draft PR on the Drill side over the weekend and will share the link to this list.
> > Respectfully, 
> > -- C
> > 
> > 
> >> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <stephen.d.lawrence@gmail.com <ma...@gmail.com>> wrote:
> >> 
> >> I definitely agree. Apache Drill seems like a logical place to add
> >> Daffodil support. And I'm sure many of us, including myself, would be
> >> happy to provide some time towards this effort.
> >> 
> >> The Daffodil API is actually fairly simple and is usually fairly
> >> straightforward to integrate--most of the complexity comes from the DFDL
> >> schemas. There's a good "hello world" available [1] that shows more API
> >> functionality/errors/etc., but the jist of it is:
> >> 
> >> 1) Compile a DFDL schema to a data processor:
> >> 
> >>  Compiler c = Daffodil.compiler();
> >>  ProcessorFactory pf = c.compileFile(file);
> >>  DataProcessor dp = pf.onPath("/");
> >> 
> >> 2) Create an input source for the data
> >> 
> >>  InputStream is = ...
> >>  InputSourceDataInputStream in = new InputSourceDataInputStream(is);
> >> 
> >> 3) Create an infoset outputter (we have a handful of differnt kinds)
> >> 
> >>  JDOMInfosetOutputter out = new JDOMInfosetOutputter();
> >> 
> >> 4) Use the DataProcessor to parse the input data to the infoset outputter
> >> 
> >>  ParseResult pr = dataProcessor.parse(in, out)
> >> 
> >> So I guess the parts that we would need more Drill understanding is what
> >> the InfosetOutputter (step 3) needs to look like to better integrate
> >> into Drill. Is there a standard data structure that Drill expects
> >> representations of data to look like and Drill does the querying on the
> >> data structure? And is there some sort of schema that Daffodil would
> >> need to create to describe what this structure looks like so it could
> >> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
> >> this data structure, unless Drill already supports XML or JSON.
> >> 
> >> Or is it completely up to the Storage Plugin (is that the right term) to
> >> determine how to take a Drill query and find the appropriate data from
> >> the data store?
> >> 
> >> - Steve
> >> 
> >> [1]
> >> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java <https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java>
> >> 
> >> 
> >> On 11/3/19 9:31 AM, Charles Givre wrote:
> >>> Hi Julian,
> >>> It seems like there is a beginning of convergence of the minds here.  I went to 
> >>> the Apache Roadshow in DC and that was where I learned about DFDL and 
> >>> immediately thought this was a really interesting possibility.
> >>> 
> >>> I'd love to see if we could foster some collaboration between the various 
> >>> projects on this.  From the Drill side of things, it would make it SO much 
> >>> easier to get Drill to read (and by extension query) various data types.  I'd be 
> >>> willing to contribute time from the Drill side, but I definitely will need help 
> >>> understanding how DFDL works.
> >>> 
> >>> --C
> >>> 
> >>> 
> >>> 
> >>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feinauer@pragmaticminds.de <ma...@pragmaticminds.de> 
> >>>> <mailto:j.feinauer@pragmaticminds.de <ma...@pragmaticminds.de>>> wrote:
> >>>> 
> >>>> Hi Charles,
> >>>> this is an interesting idea and in fact we also discussed the same matter for 
> >>>> Calcite at ApacheCon NA.
> >>>> But, I agree that it would be really powerful together with a complete Runtime 
> >>>> like Drill.
> >>>> Julian
> >>>> *Von:*Charles Givre <cgivre@gmail.com <ma...@gmail.com> <mailto:cgivre@gmail.com <ma...@gmail.com>>>
> >>>> *Antworten an:*"users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>" 
> >>>> <users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>>
> >>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
> >>>> *An:*"Costello, Roger L." <costello@mitre.org <ma...@mitre.org> <mailto:costello@mitre.org <ma...@mitre.org>>>
> >>>> *Cc:*"users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>" 
> >>>> <users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>>
> >>>> *Betreff:*Re: Use cases for DFDL
> >>>> +1
> >>>> 
> >>>> 
> >>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <costello@mitre.org <ma...@mitre.org> 
> >>>>> <mailto:costello@mitre.org <ma...@mitre.org>>> wrote:
> >>>>> Excellent! Okay, here’s the use case:
> >>>>> A Daffodil extension could be created for Apache Drill so that you could 
> >>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
> >>>>> use ANSI SQL to query the data, join it with other data, do analysis, etc., 
> >>>>> just as if it came from a database. So, instead of parsing data to XML and 
> >>>>> then using XPath to pull out data, you could instead parse data to Apache 
> >>>>> Drill's data representation and then use ANSI SQL to pull out data, and even 
> >>>>> combine it with other non-Daffodil data types. The advantage for this would 
> >>>>> be that it would make it very easy to enable Drill to query new data types 
> >>>>> (IE simply by using a DFDL schema) and it would enable users to easily query 
> >>>>> this data without having to load it into another system.
> >>>>> How’s that Charles?
> >>>>> /Roger
> >>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com> <mailto:cgivre@gmail.com <ma...@gmail.com>>>
> >>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
> >>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org> <mailto:costello@mitre.org <ma...@mitre.org>>>
> >>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>
> >>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It is 
> >>>>> regular ANSI SQL.  IMHO, I think this. would be a really great collaboration 
> >>>>> of the two communities.
> >>>>> --C
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <costello@mitre.org <ma...@mitre.org> 
> >>>>>> <mailto:costello@mitre.org <ma...@mitre.org>>> wrote:
> >>>>>> Thanks again Charles. Is the following use case description correct?
> >>>>>> A Daffodil extension could be created for Apache Drill so that you could 
> >>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
> >>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts of 
> >>>>>> that data, join it with other data, do analysis, etc., just as if it came 
> >>>>>> from a database. So, instead of parsing data to XML and then using XPath to 
> >>>>>> pull out data, you could instead parse data to Apache Drill's data 
> >>>>>> representation and then use Drills rich data-query capabilities to pull out 
> >>>>>> data, and even combine it with other non-Daffodil data types. The advantage 
> >>>>>> for this would be that it would make it very easy to enable Drill to query 
> >>>>>> new data types (IE simply by using a DFDL schema) and it would enable users 
> >>>>>> to easily query this data without having to load it into another system.
> >>>>>> Is that correct?
> >>>>>> /Roger
> >>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com> <mailto:cgivre@gmail.com <ma...@gmail.com>>>
> >>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
> >>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org> <mailto:costello@mitre.org <ma...@mitre.org>>>
> >>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>
> >>>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>>> Not exactly...
> >>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
> >>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
> >>>>>> plugin could be written for Drill that mirrors this schema and ultimately 
> >>>>>> reads the data files.  Drill wouldn't be populating any database, but rather 
> >>>>>> directly querying the data.
> >>>>>> The advantage for this would be that it would make it very easy to enable 
> >>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
> >>>>>> would enable users to easily query this data w/o having to load it into 
> >>>>>> another system.  Does that make sense?
> >>>>>> -- C
> >>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <costello@mitre.org <ma...@mitre.org> 
> >>>>>>> <mailto:costello@mitre.org <ma...@mitre.org>>> wrote:
> >>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
> >>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill to 
> >>>>>>> query the database.
> >>>>>>> Is that correct?
> >>>>>>> /Roger
> >>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com> <mailto:cgivre@gmail.com <ma...@gmail.com>>>
> >>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
> >>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>
> >>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think a 
> >>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to enable 
> >>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
> >>>>>>> applied to other SQL query engines such as Presto and/or Impala.
> >>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
> >>>>>>> -- C
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <costello@mitre.org <ma...@mitre.org> 
> >>>>>>>> <mailto:costello@mitre.org <ma...@mitre.org>>> wrote:
> >>>>>>>> Thanks Mike! I updated the slide:
> >>>>>>>> <image002.png>
> >>>>>>>> *From:*Beckerle, Mike <mbeckerle@tresys.com <ma...@tresys.com> <mailto:mbeckerle@tresys.com <ma...@tresys.com>>>
> >>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
> >>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>
> >>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>>>>> I would not pick on RDF data stores as the target.
> >>>>>>>> Parsing data to populate a database (any variety) is the actual case. The 
> >>>>>>>> fact that we did do one project involving RDF is why I cited that example 
> >>>>>>>> in particular but pulling data into any data store/data base begins with 
> >>>>>>>> the ability to parse the data, and then process it into suitable form.
> >>>>>>>> This is an incomplete list so perhaps this slide title should be "Example 
> >>>>>>>> Use Cases for DFDL" ?
> >>>>>>>> ...mikeb
> >>>>>>>> --------------------------------------------------------------------------------
> >>>>>>>> *From:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org> <mailto:costello@mitre.org <ma...@mitre.org>>>
> >>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
> >>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org> 
> >>>>>>>> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>><users@daffodil.apache.org <ma...@daffodil.apache.org> 
> >>>>>>>> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>>
> >>>>>>>> *Subject:*Use cases for DFDL
> >>>>>>>> Hi Folks,
> >>>>>>>> I created a slide of use cases. See below. Do you agree with the slide? 
> >>>>>>>> Anything you would add, delete, or change?  /Roger
> >>>>>>>> <image003.png>
> >>> 
> >>

Re: Use cases for DFDL

Posted by Charles Givre <cg...@gmail.com>.

Ok... That makes sense.
Do you know if there's some documentation about that new feature?
-- C

> On Nov 7, 2019, at 1:57 PM, Paul Rogers <pa...@yahoo.com> wrote:
> 
> Hi Charles,
> 
> 
> Your suggestion to read the schema in each reader can work. In this case, the planner knows nothing about the schema; it is discovered at scan time, by each reader, as the file is read.
> 
> 
> Let's take a step back. Drill is designed for big data distributed processing. We might imagine having 100+ files of some DFDL format on HDFS, with, say, 10+ Drillbits reading those files in using, say, 50 scan operators. in separate threads (minor fragments.)
> 
> 
> My hunch is that, since the schema is the same for all files, it would be more efficient to read the schema at plan time, then pass the schema along as part of the "physical plan" to each scan operator. That way, in the scenario above, the schema would be read once (by the planner) rather than 100 times (by each reader in each scan operator.)
> 
> 
> Further, Drill would know the type of the columns which can avoid ambiguities that occur when types are unknown.
> 
> 
> Arina recently added schema support via a "provided schema." We passed this information to the CSV reader so it can operate with a schema. Perhaps we can look at what Arina did and figure out something similar for this use case. Or, maybe even use the DFDL schema in place of the "provided" schema. Someone will need to poke around a bit to figure out the best answer.
> 
> 
> Thanks,
> 
> - Paul
> 
> 
> 
> On Thursday, November 7, 2019, 10:40:39 AM PST, Charles Givre <cg...@gmail.com> wrote:
> 
> 
> @Paul, 
> Do you think a format plugin is the right way to integrate this?  My thought was that we could create a folder for dfdl schemata, then the format plugin could specify which schema would be used during read.  IE:
> 
> "dfdl" :{
>   "type":"dfdl",
>   "file":"myschema.dfdl",
>   "extensions":["xml"]
> }
> 
> I was envisioning this working in much the same way as other format plugins that use an external parser.
> -- C
> 
> 
> > On Nov 7, 2019, at 1:35 PM, Paul Rogers <par0328@yahoo.com.INVALID <ma...@yahoo.com.INVALID>> wrote:
> > 
> > Hi All,
> > 
> > One thought to add is that if DFDL defines the file schema, then it would be ideal to use that schema at plan time as well as run time. Drill's Calcite integration provides means to do this, though I am personally a bit hazy on the details.
> > 
> > Certainly getting the reader to work is the first step; thanks Charles for the excellent summary. Then, add the needed Calcite integration to make the schema available to the planner at plan time.
> > 
> > Thanks,
> > - Paul
> > 
> > 
> > 
> >    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre <cgivre@gmail.com <ma...@gmail.com>> wrote:  
> > 
> > Hi Steve, 
> > Thanks for responding... Here's how Drill reads a file:
> > 
> > Drill uses what are called "format plugins" which basically read the file in question and map fields to column vectors.  Note:  Drill supports nested data structures, so a column could contain a MAP or LIST. 
> > 
> > The basic steps are:
> > 1.  Open the inputstream and read the file
> > 2.  If the schema is known, it is advantageous to define the schema using a schemaBuilder object in advance and create schemaWriters for each column.  In this case, since we'd be using DFDL, we do know the schema so we could create the schema BEFORE the data actually gets read.  If the schema is not known in advance, JSON for instance, Drill can discover the schema as it is reading the data, by dynamically adding column vectors as data is ingested, but that's not the case here... 
> > 3.  Once the schema is defined, Drill will then read the file row by row, parse the data, and assign values to each column vector. 
> > 
> > There are a few more details but that's the essence.  
> > 
> > What would be great is if we could create a function that could directly map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does natively support JSON, however, it would probably be more effective and efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need some sort of Iterable object so that Drill can map the parsed fields to the schema.  
> > 
> > If you want to take a look at a relatively simple format plugin take a look here: [2]. This file is the BatchReader which is where most of the heavy lifting takes place.  This plugin is for ESRI Shape files and has a mix of pre-defined fields, nested fields and fields that are defined after reading starts.
> > 
> > 
> > [1]: https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md  <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md><https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>>
> > [2]: https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java  <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java><https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>>
> > 
> > 
> > I can start a draft PR on the Drill side over the weekend and will share the link to this list.
> > Respectfully, 
> > -- C
> > 
> > 
> >> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <stephen.d.lawrence@gmail.com <ma...@gmail.com>> wrote:
> >> 
> >> I definitely agree. Apache Drill seems like a logical place to add
> >> Daffodil support. And I'm sure many of us, including myself, would be
> >> happy to provide some time towards this effort.
> >> 
> >> The Daffodil API is actually fairly simple and is usually fairly
> >> straightforward to integrate--most of the complexity comes from the DFDL
> >> schemas. There's a good "hello world" available [1] that shows more API
> >> functionality/errors/etc., but the jist of it is:
> >> 
> >> 1) Compile a DFDL schema to a data processor:
> >> 
> >>  Compiler c = Daffodil.compiler();
> >>  ProcessorFactory pf = c.compileFile(file);
> >>  DataProcessor dp = pf.onPath("/");
> >> 
> >> 2) Create an input source for the data
> >> 
> >>  InputStream is = ...
> >>  InputSourceDataInputStream in = new InputSourceDataInputStream(is);
> >> 
> >> 3) Create an infoset outputter (we have a handful of differnt kinds)
> >> 
> >>  JDOMInfosetOutputter out = new JDOMInfosetOutputter();
> >> 
> >> 4) Use the DataProcessor to parse the input data to the infoset outputter
> >> 
> >>  ParseResult pr = dataProcessor.parse(in, out)
> >> 
> >> So I guess the parts that we would need more Drill understanding is what
> >> the InfosetOutputter (step 3) needs to look like to better integrate
> >> into Drill. Is there a standard data structure that Drill expects
> >> representations of data to look like and Drill does the querying on the
> >> data structure? And is there some sort of schema that Daffodil would
> >> need to create to describe what this structure looks like so it could
> >> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
> >> this data structure, unless Drill already supports XML or JSON.
> >> 
> >> Or is it completely up to the Storage Plugin (is that the right term) to
> >> determine how to take a Drill query and find the appropriate data from
> >> the data store?
> >> 
> >> - Steve
> >> 
> >> [1]
> >> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java <https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java>
> >> 
> >> 
> >> On 11/3/19 9:31 AM, Charles Givre wrote:
> >>> Hi Julian,
> >>> It seems like there is a beginning of convergence of the minds here.  I went to 
> >>> the Apache Roadshow in DC and that was where I learned about DFDL and 
> >>> immediately thought this was a really interesting possibility.
> >>> 
> >>> I'd love to see if we could foster some collaboration between the various 
> >>> projects on this.  From the Drill side of things, it would make it SO much 
> >>> easier to get Drill to read (and by extension query) various data types.  I'd be 
> >>> willing to contribute time from the Drill side, but I definitely will need help 
> >>> understanding how DFDL works.
> >>> 
> >>> --C
> >>> 
> >>> 
> >>> 
> >>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feinauer@pragmaticminds.de <ma...@pragmaticminds.de> 
> >>>> <mailto:j.feinauer@pragmaticminds.de <ma...@pragmaticminds.de>>> wrote:
> >>>> 
> >>>> Hi Charles,
> >>>> this is an interesting idea and in fact we also discussed the same matter for 
> >>>> Calcite at ApacheCon NA.
> >>>> But, I agree that it would be really powerful together with a complete Runtime 
> >>>> like Drill.
> >>>> Julian
> >>>> *Von:*Charles Givre <cgivre@gmail.com <ma...@gmail.com> <mailto:cgivre@gmail.com <ma...@gmail.com>>>
> >>>> *Antworten an:*"users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>" 
> >>>> <users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>>
> >>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
> >>>> *An:*"Costello, Roger L." <costello@mitre.org <ma...@mitre.org> <mailto:costello@mitre.org <ma...@mitre.org>>>
> >>>> *Cc:*"users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>" 
> >>>> <users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>>
> >>>> *Betreff:*Re: Use cases for DFDL
> >>>> +1
> >>>> 
> >>>> 
> >>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <costello@mitre.org <ma...@mitre.org> 
> >>>>> <mailto:costello@mitre.org <ma...@mitre.org>>> wrote:
> >>>>> Excellent! Okay, here’s the use case:
> >>>>> A Daffodil extension could be created for Apache Drill so that you could 
> >>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
> >>>>> use ANSI SQL to query the data, join it with other data, do analysis, etc., 
> >>>>> just as if it came from a database. So, instead of parsing data to XML and 
> >>>>> then using XPath to pull out data, you could instead parse data to Apache 
> >>>>> Drill's data representation and then use ANSI SQL to pull out data, and even 
> >>>>> combine it with other non-Daffodil data types. The advantage for this would 
> >>>>> be that it would make it very easy to enable Drill to query new data types 
> >>>>> (IE simply by using a DFDL schema) and it would enable users to easily query 
> >>>>> this data without having to load it into another system.
> >>>>> How’s that Charles?
> >>>>> /Roger
> >>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com> <mailto:cgivre@gmail.com <ma...@gmail.com>>>
> >>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
> >>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org> <mailto:costello@mitre.org <ma...@mitre.org>>>
> >>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>
> >>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It is 
> >>>>> regular ANSI SQL.  IMHO, I think this. would be a really great collaboration 
> >>>>> of the two communities.
> >>>>> --C
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <costello@mitre.org <ma...@mitre.org> 
> >>>>>> <mailto:costello@mitre.org <ma...@mitre.org>>> wrote:
> >>>>>> Thanks again Charles. Is the following use case description correct?
> >>>>>> A Daffodil extension could be created for Apache Drill so that you could 
> >>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
> >>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts of 
> >>>>>> that data, join it with other data, do analysis, etc., just as if it came 
> >>>>>> from a database. So, instead of parsing data to XML and then using XPath to 
> >>>>>> pull out data, you could instead parse data to Apache Drill's data 
> >>>>>> representation and then use Drills rich data-query capabilities to pull out 
> >>>>>> data, and even combine it with other non-Daffodil data types. The advantage 
> >>>>>> for this would be that it would make it very easy to enable Drill to query 
> >>>>>> new data types (IE simply by using a DFDL schema) and it would enable users 
> >>>>>> to easily query this data without having to load it into another system.
> >>>>>> Is that correct?
> >>>>>> /Roger
> >>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com> <mailto:cgivre@gmail.com <ma...@gmail.com>>>
> >>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
> >>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org> <mailto:costello@mitre.org <ma...@mitre.org>>>
> >>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>
> >>>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>>> Not exactly...
> >>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
> >>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
> >>>>>> plugin could be written for Drill that mirrors this schema and ultimately 
> >>>>>> reads the data files.  Drill wouldn't be populating any database, but rather 
> >>>>>> directly querying the data.
> >>>>>> The advantage for this would be that it would make it very easy to enable 
> >>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
> >>>>>> would enable users to easily query this data w/o having to load it into 
> >>>>>> another system.  Does that make sense?
> >>>>>> -- C
> >>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <costello@mitre.org <ma...@mitre.org> 
> >>>>>>> <mailto:costello@mitre.org <ma...@mitre.org>>> wrote:
> >>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
> >>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill to 
> >>>>>>> query the database.
> >>>>>>> Is that correct?
> >>>>>>> /Roger
> >>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com> <mailto:cgivre@gmail.com <ma...@gmail.com>>>
> >>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
> >>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>
> >>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think a 
> >>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to enable 
> >>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
> >>>>>>> applied to other SQL query engines such as Presto and/or Impala.
> >>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
> >>>>>>> -- C
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <costello@mitre.org <ma...@mitre.org> 
> >>>>>>>> <mailto:costello@mitre.org <ma...@mitre.org>>> wrote:
> >>>>>>>> Thanks Mike! I updated the slide:
> >>>>>>>> <image002.png>
> >>>>>>>> *From:*Beckerle, Mike <mbeckerle@tresys.com <ma...@tresys.com> <mailto:mbeckerle@tresys.com <ma...@tresys.com>>>
> >>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
> >>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>
> >>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>>>>> I would not pick on RDF data stores as the target.
> >>>>>>>> Parsing data to populate a database (any variety) is the actual case. The 
> >>>>>>>> fact that we did do one project involving RDF is why I cited that example 
> >>>>>>>> in particular but pulling data into any data store/data base begins with 
> >>>>>>>> the ability to parse the data, and then process it into suitable form.
> >>>>>>>> This is an incomplete list so perhaps this slide title should be "Example 
> >>>>>>>> Use Cases for DFDL" ?
> >>>>>>>> ...mikeb
> >>>>>>>> --------------------------------------------------------------------------------
> >>>>>>>> *From:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org> <mailto:costello@mitre.org <ma...@mitre.org>>>
> >>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
> >>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org> 
> >>>>>>>> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>><users@daffodil.apache.org <ma...@daffodil.apache.org> 
> >>>>>>>> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>>
> >>>>>>>> *Subject:*Use cases for DFDL
> >>>>>>>> Hi Folks,
> >>>>>>>> I created a slide of use cases. See below. Do you agree with the slide? 
> >>>>>>>> Anything you would add, delete, or change?  /Roger
> >>>>>>>> <image003.png>
> >>> 
> >>

Re: Use cases for DFDL

Posted by Charles Givre <cg...@gmail.com>.

Ok... That makes sense.
Do you know if there's some documentation about that new feature?
-- C

> On Nov 7, 2019, at 1:57 PM, Paul Rogers <pa...@yahoo.com> wrote:
> 
> Hi Charles,
> 
> 
> Your suggestion to read the schema in each reader can work. In this case, the planner knows nothing about the schema; it is discovered at scan time, by each reader, as the file is read.
> 
> 
> Let's take a step back. Drill is designed for big data distributed processing. We might imagine having 100+ files of some DFDL format on HDFS, with, say, 10+ Drillbits reading those files in using, say, 50 scan operators. in separate threads (minor fragments.)
> 
> 
> My hunch is that, since the schema is the same for all files, it would be more efficient to read the schema at plan time, then pass the schema along as part of the "physical plan" to each scan operator. That way, in the scenario above, the schema would be read once (by the planner) rather than 100 times (by each reader in each scan operator.)
> 
> 
> Further, Drill would know the type of the columns which can avoid ambiguities that occur when types are unknown.
> 
> 
> Arina recently added schema support via a "provided schema." We passed this information to the CSV reader so it can operate with a schema. Perhaps we can look at what Arina did and figure out something similar for this use case. Or, maybe even use the DFDL schema in place of the "provided" schema. Someone will need to poke around a bit to figure out the best answer.
> 
> 
> Thanks,
> 
> - Paul
> 
> 
> 
> On Thursday, November 7, 2019, 10:40:39 AM PST, Charles Givre <cg...@gmail.com> wrote:
> 
> 
> @Paul, 
> Do you think a format plugin is the right way to integrate this?  My thought was that we could create a folder for dfdl schemata, then the format plugin could specify which schema would be used during read.  IE:
> 
> "dfdl" :{
>   "type":"dfdl",
>   "file":"myschema.dfdl",
>   "extensions":["xml"]
> }
> 
> I was envisioning this working in much the same way as other format plugins that use an external parser.
> -- C
> 
> 
> > On Nov 7, 2019, at 1:35 PM, Paul Rogers <par0328@yahoo.com.INVALID <ma...@yahoo.com.INVALID>> wrote:
> > 
> > Hi All,
> > 
> > One thought to add is that if DFDL defines the file schema, then it would be ideal to use that schema at plan time as well as run time. Drill's Calcite integration provides means to do this, though I am personally a bit hazy on the details.
> > 
> > Certainly getting the reader to work is the first step; thanks Charles for the excellent summary. Then, add the needed Calcite integration to make the schema available to the planner at plan time.
> > 
> > Thanks,
> > - Paul
> > 
> > 
> > 
> >    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre <cgivre@gmail.com <ma...@gmail.com>> wrote:  
> > 
> > Hi Steve, 
> > Thanks for responding... Here's how Drill reads a file:
> > 
> > Drill uses what are called "format plugins" which basically read the file in question and map fields to column vectors.  Note:  Drill supports nested data structures, so a column could contain a MAP or LIST. 
> > 
> > The basic steps are:
> > 1.  Open the inputstream and read the file
> > 2.  If the schema is known, it is advantageous to define the schema using a schemaBuilder object in advance and create schemaWriters for each column.  In this case, since we'd be using DFDL, we do know the schema so we could create the schema BEFORE the data actually gets read.  If the schema is not known in advance, JSON for instance, Drill can discover the schema as it is reading the data, by dynamically adding column vectors as data is ingested, but that's not the case here... 
> > 3.  Once the schema is defined, Drill will then read the file row by row, parse the data, and assign values to each column vector. 
> > 
> > There are a few more details but that's the essence.  
> > 
> > What would be great is if we could create a function that could directly map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does natively support JSON, however, it would probably be more effective and efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need some sort of Iterable object so that Drill can map the parsed fields to the schema.  
> > 
> > If you want to take a look at a relatively simple format plugin take a look here: [2]. This file is the BatchReader which is where most of the heavy lifting takes place.  This plugin is for ESRI Shape files and has a mix of pre-defined fields, nested fields and fields that are defined after reading starts.
> > 
> > 
> > [1]: https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md  <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md><https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>>
> > [2]: https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java  <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java><https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>>
> > 
> > 
> > I can start a draft PR on the Drill side over the weekend and will share the link to this list.
> > Respectfully, 
> > -- C
> > 
> > 
> >> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <stephen.d.lawrence@gmail.com <ma...@gmail.com>> wrote:
> >> 
> >> I definitely agree. Apache Drill seems like a logical place to add
> >> Daffodil support. And I'm sure many of us, including myself, would be
> >> happy to provide some time towards this effort.
> >> 
> >> The Daffodil API is actually fairly simple and is usually fairly
> >> straightforward to integrate--most of the complexity comes from the DFDL
> >> schemas. There's a good "hello world" available [1] that shows more API
> >> functionality/errors/etc., but the jist of it is:
> >> 
> >> 1) Compile a DFDL schema to a data processor:
> >> 
> >>  Compiler c = Daffodil.compiler();
> >>  ProcessorFactory pf = c.compileFile(file);
> >>  DataProcessor dp = pf.onPath("/");
> >> 
> >> 2) Create an input source for the data
> >> 
> >>  InputStream is = ...
> >>  InputSourceDataInputStream in = new InputSourceDataInputStream(is);
> >> 
> >> 3) Create an infoset outputter (we have a handful of differnt kinds)
> >> 
> >>  JDOMInfosetOutputter out = new JDOMInfosetOutputter();
> >> 
> >> 4) Use the DataProcessor to parse the input data to the infoset outputter
> >> 
> >>  ParseResult pr = dataProcessor.parse(in, out)
> >> 
> >> So I guess the parts that we would need more Drill understanding is what
> >> the InfosetOutputter (step 3) needs to look like to better integrate
> >> into Drill. Is there a standard data structure that Drill expects
> >> representations of data to look like and Drill does the querying on the
> >> data structure? And is there some sort of schema that Daffodil would
> >> need to create to describe what this structure looks like so it could
> >> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
> >> this data structure, unless Drill already supports XML or JSON.
> >> 
> >> Or is it completely up to the Storage Plugin (is that the right term) to
> >> determine how to take a Drill query and find the appropriate data from
> >> the data store?
> >> 
> >> - Steve
> >> 
> >> [1]
> >> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java <https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java>
> >> 
> >> 
> >> On 11/3/19 9:31 AM, Charles Givre wrote:
> >>> Hi Julian,
> >>> It seems like there is a beginning of convergence of the minds here.  I went to 
> >>> the Apache Roadshow in DC and that was where I learned about DFDL and 
> >>> immediately thought this was a really interesting possibility.
> >>> 
> >>> I'd love to see if we could foster some collaboration between the various 
> >>> projects on this.  From the Drill side of things, it would make it SO much 
> >>> easier to get Drill to read (and by extension query) various data types.  I'd be 
> >>> willing to contribute time from the Drill side, but I definitely will need help 
> >>> understanding how DFDL works.
> >>> 
> >>> --C
> >>> 
> >>> 
> >>> 
> >>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feinauer@pragmaticminds.de <ma...@pragmaticminds.de> 
> >>>> <mailto:j.feinauer@pragmaticminds.de <ma...@pragmaticminds.de>>> wrote:
> >>>> 
> >>>> Hi Charles,
> >>>> this is an interesting idea and in fact we also discussed the same matter for 
> >>>> Calcite at ApacheCon NA.
> >>>> But, I agree that it would be really powerful together with a complete Runtime 
> >>>> like Drill.
> >>>> Julian
> >>>> *Von:*Charles Givre <cgivre@gmail.com <ma...@gmail.com> <mailto:cgivre@gmail.com <ma...@gmail.com>>>
> >>>> *Antworten an:*"users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>" 
> >>>> <users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>>
> >>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
> >>>> *An:*"Costello, Roger L." <costello@mitre.org <ma...@mitre.org> <mailto:costello@mitre.org <ma...@mitre.org>>>
> >>>> *Cc:*"users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>" 
> >>>> <users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>>
> >>>> *Betreff:*Re: Use cases for DFDL
> >>>> +1
> >>>> 
> >>>> 
> >>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <costello@mitre.org <ma...@mitre.org> 
> >>>>> <mailto:costello@mitre.org <ma...@mitre.org>>> wrote:
> >>>>> Excellent! Okay, here’s the use case:
> >>>>> A Daffodil extension could be created for Apache Drill so that you could 
> >>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
> >>>>> use ANSI SQL to query the data, join it with other data, do analysis, etc., 
> >>>>> just as if it came from a database. So, instead of parsing data to XML and 
> >>>>> then using XPath to pull out data, you could instead parse data to Apache 
> >>>>> Drill's data representation and then use ANSI SQL to pull out data, and even 
> >>>>> combine it with other non-Daffodil data types. The advantage for this would 
> >>>>> be that it would make it very easy to enable Drill to query new data types 
> >>>>> (IE simply by using a DFDL schema) and it would enable users to easily query 
> >>>>> this data without having to load it into another system.
> >>>>> How’s that Charles?
> >>>>> /Roger
> >>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com> <mailto:cgivre@gmail.com <ma...@gmail.com>>>
> >>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
> >>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org> <mailto:costello@mitre.org <ma...@mitre.org>>>
> >>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>
> >>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It is 
> >>>>> regular ANSI SQL.  IMHO, I think this. would be a really great collaboration 
> >>>>> of the two communities.
> >>>>> --C
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <costello@mitre.org <ma...@mitre.org> 
> >>>>>> <mailto:costello@mitre.org <ma...@mitre.org>>> wrote:
> >>>>>> Thanks again Charles. Is the following use case description correct?
> >>>>>> A Daffodil extension could be created for Apache Drill so that you could 
> >>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
> >>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts of 
> >>>>>> that data, join it with other data, do analysis, etc., just as if it came 
> >>>>>> from a database. So, instead of parsing data to XML and then using XPath to 
> >>>>>> pull out data, you could instead parse data to Apache Drill's data 
> >>>>>> representation and then use Drills rich data-query capabilities to pull out 
> >>>>>> data, and even combine it with other non-Daffodil data types. The advantage 
> >>>>>> for this would be that it would make it very easy to enable Drill to query 
> >>>>>> new data types (IE simply by using a DFDL schema) and it would enable users 
> >>>>>> to easily query this data without having to load it into another system.
> >>>>>> Is that correct?
> >>>>>> /Roger
> >>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com> <mailto:cgivre@gmail.com <ma...@gmail.com>>>
> >>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
> >>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org> <mailto:costello@mitre.org <ma...@mitre.org>>>
> >>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>
> >>>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>>> Not exactly...
> >>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
> >>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
> >>>>>> plugin could be written for Drill that mirrors this schema and ultimately 
> >>>>>> reads the data files.  Drill wouldn't be populating any database, but rather 
> >>>>>> directly querying the data.
> >>>>>> The advantage for this would be that it would make it very easy to enable 
> >>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
> >>>>>> would enable users to easily query this data w/o having to load it into 
> >>>>>> another system.  Does that make sense?
> >>>>>> -- C
> >>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <costello@mitre.org <ma...@mitre.org> 
> >>>>>>> <mailto:costello@mitre.org <ma...@mitre.org>>> wrote:
> >>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
> >>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill to 
> >>>>>>> query the database.
> >>>>>>> Is that correct?
> >>>>>>> /Roger
> >>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com> <mailto:cgivre@gmail.com <ma...@gmail.com>>>
> >>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
> >>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>
> >>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think a 
> >>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to enable 
> >>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
> >>>>>>> applied to other SQL query engines such as Presto and/or Impala.
> >>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
> >>>>>>> -- C
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <costello@mitre.org <ma...@mitre.org> 
> >>>>>>>> <mailto:costello@mitre.org <ma...@mitre.org>>> wrote:
> >>>>>>>> Thanks Mike! I updated the slide:
> >>>>>>>> <image002.png>
> >>>>>>>> *From:*Beckerle, Mike <mbeckerle@tresys.com <ma...@tresys.com> <mailto:mbeckerle@tresys.com <ma...@tresys.com>>>
> >>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
> >>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>
> >>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>>>>> I would not pick on RDF data stores as the target.
> >>>>>>>> Parsing data to populate a database (any variety) is the actual case. The 
> >>>>>>>> fact that we did do one project involving RDF is why I cited that example 
> >>>>>>>> in particular but pulling data into any data store/data base begins with 
> >>>>>>>> the ability to parse the data, and then process it into suitable form.
> >>>>>>>> This is an incomplete list so perhaps this slide title should be "Example 
> >>>>>>>> Use Cases for DFDL" ?
> >>>>>>>> ...mikeb
> >>>>>>>> --------------------------------------------------------------------------------
> >>>>>>>> *From:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org> <mailto:costello@mitre.org <ma...@mitre.org>>>
> >>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
> >>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org> 
> >>>>>>>> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>><users@daffodil.apache.org <ma...@daffodil.apache.org> 
> >>>>>>>> <mailto:users@daffodil.apache.org <ma...@daffodil.apache.org>>>
> >>>>>>>> *Subject:*Use cases for DFDL
> >>>>>>>> Hi Folks,
> >>>>>>>> I created a slide of use cases. See below. Do you agree with the slide? 
> >>>>>>>> Anything you would add, delete, or change?  /Roger
> >>>>>>>> <image003.png>
> >>> 
> >>

Re: Use cases for DFDL

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Charles,

Your suggestion to read the schema in each reader can work. In this case, the planner knows nothing about the schema; it is discovered at scan time, by each reader, as the file is read.


Let's take a step back. Drill is designed for big data distributed processing. We might imagine having 100+ files of some DFDL format on HDFS, with, say, 10+ Drillbits reading those files in using, say, 50 scan operators. in separate threads (minor fragments.)

My hunch is that, since the schema is the same for all files, it would be more efficient to read the schema at plan time, then pass the schema along as part of the "physical plan" to each scan operator. That way, in the scenario above, the schema would be read once (by the planner) rather than 100 times (by each reader in each scan operator.)

Further, Drill would know the type of the columns which can avoid ambiguities that occur when types are unknown.

Arina recently added schema support via a "provided schema." We passed this information to the CSV reader so it can operate with a schema. Perhaps we can look at what Arina did and figure out something similar for this use case. Or, maybe even use the DFDL schema in place of the "provided" schema. Someone will need to poke around a bit to figure out the best answer.

Thanks,
- Paul

 

    On Thursday, November 7, 2019, 10:40:39 AM PST, Charles Givre <cg...@gmail.com> wrote:  
 
 @Paul, 
Do you think a format plugin is the right way to integrate this?  My thought was that we could create a folder for dfdl schemata, then the format plugin could specify which schema would be used during read.  IE:

"dfdl" :{
  "type":"dfdl",
  "file":"myschema.dfdl",
  "extensions":["xml"]
}

I was envisioning this working in much the same way as other format plugins that use an external parser.
-- C


> On Nov 7, 2019, at 1:35 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> 
> One thought to add is that if DFDL defines the file schema, then it would be ideal to use that schema at plan time as well as run time. Drill's Calcite integration provides means to do this, though I am personally a bit hazy on the details.
> 
> Certainly getting the reader to work is the first step; thanks Charles for the excellent summary. Then, add the needed Calcite integration to make the schema available to the planner at plan time.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre <cg...@gmail.com> wrote:  
> 
> Hi Steve, 
> Thanks for responding... Here's how Drill reads a file:
> 
> Drill uses what are called "format plugins" which basically read the file in question and map fields to column vectors.  Note:  Drill supports nested data structures, so a column could contain a MAP or LIST. 
> 
> The basic steps are:
> 1.  Open the inputstream and read the file
> 2.  If the schema is known, it is advantageous to define the schema using a schemaBuilder object in advance and create schemaWriters for each column.  In this case, since we'd be using DFDL, we do know the schema so we could create the schema BEFORE the data actually gets read.  If the schema is not known in advance, JSON for instance, Drill can discover the schema as it is reading the data, by dynamically adding column vectors as data is ingested, but that's not the case here... 
> 3.  Once the schema is defined, Drill will then read the file row by row, parse the data, and assign values to each column vector. 
> 
> There are a few more details but that's the essence.  
> 
> What would be great is if we could create a function that could directly map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does natively support JSON, however, it would probably be more effective and efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need some sort of Iterable object so that Drill can map the parsed fields to the schema.  
> 
> If you want to take a look at a relatively simple format plugin take a look here: [2]. This file is the BatchReader which is where most of the heavy lifting takes place.  This plugin is for ESRI Shape files and has a mix of pre-defined fields, nested fields and fields that are defined after reading starts.
> 
> 
> [1]: https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
> [2]: https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>
> 
> 
> I can start a draft PR on the Drill side over the weekend and will share the link to this list.
> Respectfully, 
> -- C
> 
> 
>> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <st...@gmail.com> wrote:
>> 
>> I definitely agree. Apache Drill seems like a logical place to add
>> Daffodil support. And I'm sure many of us, including myself, would be
>> happy to provide some time towards this effort.
>> 
>> The Daffodil API is actually fairly simple and is usually fairly
>> straightforward to integrate--most of the complexity comes from the DFDL
>> schemas. There's a good "hello world" available [1] that shows more API
>> functionality/errors/etc., but the jist of it is:
>> 
>> 1) Compile a DFDL schema to a data processor:
>> 
>>  Compiler c = Daffodil.compiler();
>>  ProcessorFactory pf = c.compileFile(file);
>>  DataProcessor dp = pf.onPath("/");
>> 
>> 2) Create an input source for the data
>> 
>>  InputStream is = ...
>>  InputSourceDataInputStream in = new InputSourceDataInputStream(is);
>> 
>> 3) Create an infoset outputter (we have a handful of differnt kinds)
>> 
>>  JDOMInfosetOutputter out = new JDOMInfosetOutputter();
>> 
>> 4) Use the DataProcessor to parse the input data to the infoset outputter
>> 
>>  ParseResult pr = dataProcessor.parse(in, out)
>> 
>> So I guess the parts that we would need more Drill understanding is what
>> the InfosetOutputter (step 3) needs to look like to better integrate
>> into Drill. Is there a standard data structure that Drill expects
>> representations of data to look like and Drill does the querying on the
>> data structure? And is there some sort of schema that Daffodil would
>> need to create to describe what this structure looks like so it could
>> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
>> this data structure, unless Drill already supports XML or JSON.
>> 
>> Or is it completely up to the Storage Plugin (is that the right term) to
>> determine how to take a Drill query and find the appropriate data from
>> the data store?
>> 
>> - Steve
>> 
>> [1]
>> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
>> 
>> 
>> On 11/3/19 9:31 AM, Charles Givre wrote:
>>> Hi Julian,
>>> It seems like there is a beginning of convergence of the minds here.  I went to 
>>> the Apache Roadshow in DC and that was where I learned about DFDL and 
>>> immediately thought this was a really interesting possibility.
>>> 
>>> I'd love to see if we could foster some collaboration between the various 
>>> projects on this.  From the Drill side of things, it would make it SO much 
>>> easier to get Drill to read (and by extension query) various data types.  I'd be 
>>> willing to contribute time from the Drill side, but I definitely will need help 
>>> understanding how DFDL works.
>>> 
>>> --C
>>> 
>>> 
>>> 
>>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feinauer@pragmaticminds.de 
>>>> <ma...@pragmaticminds.de>> wrote:
>>>> 
>>>> Hi Charles,
>>>> this is an interesting idea and in fact we also discussed the same matter for 
>>>> Calcite at ApacheCon NA.
>>>> But, I agree that it would be really powerful together with a complete Runtime 
>>>> like Drill.
>>>> Julian
>>>> *Von:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>> *Antworten an:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
>>>> *An:*"Costello, Roger L." <costello@mitre.org <ma...@mitre.org>>
>>>> *Cc:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>>> *Betreff:*Re: Use cases for DFDL
>>>> +1
>>>> 
>>>> 
>>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <costello@mitre.org 
>>>>> <ma...@mitre.org>> wrote:
>>>>> Excellent! Okay, here’s the use case:
>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>> use ANSI SQL to query the data, join it with other data, do analysis, etc., 
>>>>> just as if it came from a database. So, instead of parsing data to XML and 
>>>>> then using XPath to pull out data, you could instead parse data to Apache 
>>>>> Drill's data representation and then use ANSI SQL to pull out data, and even 
>>>>> combine it with other non-Daffodil data types. The advantage for this would 
>>>>> be that it would make it very easy to enable Drill to query new data types 
>>>>> (IE simply by using a DFDL schema) and it would enable users to easily query 
>>>>> this data without having to load it into another system.
>>>>> How’s that Charles?
>>>>> /Roger
>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It is 
>>>>> regular ANSI SQL.  IMHO, I think this. would be a really great collaboration 
>>>>> of the two communities.
>>>>> --C
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <costello@mitre.org 
>>>>>> <ma...@mitre.org>> wrote:
>>>>>> Thanks again Charles. Is the following use case description correct?
>>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts of 
>>>>>> that data, join it with other data, do analysis, etc., just as if it came 
>>>>>> from a database. So, instead of parsing data to XML and then using XPath to 
>>>>>> pull out data, you could instead parse data to Apache Drill's data 
>>>>>> representation and then use Drills rich data-query capabilities to pull out 
>>>>>> data, and even combine it with other non-Daffodil data types. The advantage 
>>>>>> for this would be that it would make it very easy to enable Drill to query 
>>>>>> new data types (IE simply by using a DFDL schema) and it would enable users 
>>>>>> to easily query this data without having to load it into another system.
>>>>>> Is that correct?
>>>>>> /Roger
>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
>>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>> Not exactly...
>>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
>>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
>>>>>> plugin could be written for Drill that mirrors this schema and ultimately 
>>>>>> reads the data files.  Drill wouldn't be populating any database, but rather 
>>>>>> directly querying the data.
>>>>>> The advantage for this would be that it would make it very easy to enable 
>>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
>>>>>> would enable users to easily query this data w/o having to load it into 
>>>>>> another system.  Does that make sense?
>>>>>> -- C
>>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <costello@mitre.org 
>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
>>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill to 
>>>>>>> query the database.
>>>>>>> Is that correct?
>>>>>>> /Roger
>>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think a 
>>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to enable 
>>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
>>>>>>> applied to other SQL query engines such as Presto and/or Impala.
>>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
>>>>>>> -- C
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <costello@mitre.org 
>>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>>> Thanks Mike! I updated the slide:
>>>>>>>> <image002.png>
>>>>>>>> *From:*Beckerle, Mike <mbeckerle@tresys.com <ma...@tresys.com>>
>>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
>>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>>> I would not pick on RDF data stores as the target.
>>>>>>>> Parsing data to populate a database (any variety) is the actual case. The 
>>>>>>>> fact that we did do one project involving RDF is why I cited that example 
>>>>>>>> in particular but pulling data into any data store/data base begins with 
>>>>>>>> the ability to parse the data, and then process it into suitable form.
>>>>>>>> This is an incomplete list so perhaps this slide title should be "Example 
>>>>>>>> Use Cases for DFDL" ?
>>>>>>>> ...mikeb
>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>> *From:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
>>>>>>>> *To:*users@daffodil.apache.org 
>>>>>>>> <ma...@daffodil.apache.org><users@daffodil.apache.org 
>>>>>>>> <ma...@daffodil.apache.org>>
>>>>>>>> *Subject:*Use cases for DFDL
>>>>>>>> Hi Folks,
>>>>>>>> I created a slide of use cases. See below. Do you agree with the slide? 
>>>>>>>> Anything you would add, delete, or change?  /Roger
>>>>>>>> <image003.png>
>>> 
>>

Re: Use cases for DFDL

Posted by Paul Rogers <pa...@yahoo.com>.

Hi Charles,

Your suggestion to read the schema in each reader can work. In this case, the planner knows nothing about the schema; it is discovered at scan time, by each reader, as the file is read.


Let's take a step back. Drill is designed for big data distributed processing. We might imagine having 100+ files of some DFDL format on HDFS, with, say, 10+ Drillbits reading those files in using, say, 50 scan operators. in separate threads (minor fragments.)

My hunch is that, since the schema is the same for all files, it would be more efficient to read the schema at plan time, then pass the schema along as part of the "physical plan" to each scan operator. That way, in the scenario above, the schema would be read once (by the planner) rather than 100 times (by each reader in each scan operator.)

Further, Drill would know the type of the columns which can avoid ambiguities that occur when types are unknown.

Arina recently added schema support via a "provided schema." We passed this information to the CSV reader so it can operate with a schema. Perhaps we can look at what Arina did and figure out something similar for this use case. Or, maybe even use the DFDL schema in place of the "provided" schema. Someone will need to poke around a bit to figure out the best answer.

Thanks,
- Paul

 

    On Thursday, November 7, 2019, 10:40:39 AM PST, Charles Givre <cg...@gmail.com> wrote:  
 
 @Paul, 
Do you think a format plugin is the right way to integrate this?  My thought was that we could create a folder for dfdl schemata, then the format plugin could specify which schema would be used during read.  IE:

"dfdl" :{
  "type":"dfdl",
  "file":"myschema.dfdl",
  "extensions":["xml"]
}

I was envisioning this working in much the same way as other format plugins that use an external parser.
-- C


> On Nov 7, 2019, at 1:35 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> 
> One thought to add is that if DFDL defines the file schema, then it would be ideal to use that schema at plan time as well as run time. Drill's Calcite integration provides means to do this, though I am personally a bit hazy on the details.
> 
> Certainly getting the reader to work is the first step; thanks Charles for the excellent summary. Then, add the needed Calcite integration to make the schema available to the planner at plan time.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre <cg...@gmail.com> wrote:  
> 
> Hi Steve, 
> Thanks for responding... Here's how Drill reads a file:
> 
> Drill uses what are called "format plugins" which basically read the file in question and map fields to column vectors.  Note:  Drill supports nested data structures, so a column could contain a MAP or LIST. 
> 
> The basic steps are:
> 1.  Open the inputstream and read the file
> 2.  If the schema is known, it is advantageous to define the schema using a schemaBuilder object in advance and create schemaWriters for each column.  In this case, since we'd be using DFDL, we do know the schema so we could create the schema BEFORE the data actually gets read.  If the schema is not known in advance, JSON for instance, Drill can discover the schema as it is reading the data, by dynamically adding column vectors as data is ingested, but that's not the case here... 
> 3.  Once the schema is defined, Drill will then read the file row by row, parse the data, and assign values to each column vector. 
> 
> There are a few more details but that's the essence.  
> 
> What would be great is if we could create a function that could directly map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does natively support JSON, however, it would probably be more effective and efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need some sort of Iterable object so that Drill can map the parsed fields to the schema.  
> 
> If you want to take a look at a relatively simple format plugin take a look here: [2]. This file is the BatchReader which is where most of the heavy lifting takes place.  This plugin is for ESRI Shape files and has a mix of pre-defined fields, nested fields and fields that are defined after reading starts.
> 
> 
> [1]: https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
> [2]: https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>
> 
> 
> I can start a draft PR on the Drill side over the weekend and will share the link to this list.
> Respectfully, 
> -- C
> 
> 
>> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <st...@gmail.com> wrote:
>> 
>> I definitely agree. Apache Drill seems like a logical place to add
>> Daffodil support. And I'm sure many of us, including myself, would be
>> happy to provide some time towards this effort.
>> 
>> The Daffodil API is actually fairly simple and is usually fairly
>> straightforward to integrate--most of the complexity comes from the DFDL
>> schemas. There's a good "hello world" available [1] that shows more API
>> functionality/errors/etc., but the jist of it is:
>> 
>> 1) Compile a DFDL schema to a data processor:
>> 
>>  Compiler c = Daffodil.compiler();
>>  ProcessorFactory pf = c.compileFile(file);
>>  DataProcessor dp = pf.onPath("/");
>> 
>> 2) Create an input source for the data
>> 
>>  InputStream is = ...
>>  InputSourceDataInputStream in = new InputSourceDataInputStream(is);
>> 
>> 3) Create an infoset outputter (we have a handful of differnt kinds)
>> 
>>  JDOMInfosetOutputter out = new JDOMInfosetOutputter();
>> 
>> 4) Use the DataProcessor to parse the input data to the infoset outputter
>> 
>>  ParseResult pr = dataProcessor.parse(in, out)
>> 
>> So I guess the parts that we would need more Drill understanding is what
>> the InfosetOutputter (step 3) needs to look like to better integrate
>> into Drill. Is there a standard data structure that Drill expects
>> representations of data to look like and Drill does the querying on the
>> data structure? And is there some sort of schema that Daffodil would
>> need to create to describe what this structure looks like so it could
>> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
>> this data structure, unless Drill already supports XML or JSON.
>> 
>> Or is it completely up to the Storage Plugin (is that the right term) to
>> determine how to take a Drill query and find the appropriate data from
>> the data store?
>> 
>> - Steve
>> 
>> [1]
>> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
>> 
>> 
>> On 11/3/19 9:31 AM, Charles Givre wrote:
>>> Hi Julian,
>>> It seems like there is a beginning of convergence of the minds here.  I went to 
>>> the Apache Roadshow in DC and that was where I learned about DFDL and 
>>> immediately thought this was a really interesting possibility.
>>> 
>>> I'd love to see if we could foster some collaboration between the various 
>>> projects on this.  From the Drill side of things, it would make it SO much 
>>> easier to get Drill to read (and by extension query) various data types.  I'd be 
>>> willing to contribute time from the Drill side, but I definitely will need help 
>>> understanding how DFDL works.
>>> 
>>> --C
>>> 
>>> 
>>> 
>>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feinauer@pragmaticminds.de 
>>>> <ma...@pragmaticminds.de>> wrote:
>>>> 
>>>> Hi Charles,
>>>> this is an interesting idea and in fact we also discussed the same matter for 
>>>> Calcite at ApacheCon NA.
>>>> But, I agree that it would be really powerful together with a complete Runtime 
>>>> like Drill.
>>>> Julian
>>>> *Von:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>> *Antworten an:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
>>>> *An:*"Costello, Roger L." <costello@mitre.org <ma...@mitre.org>>
>>>> *Cc:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>>> *Betreff:*Re: Use cases for DFDL
>>>> +1
>>>> 
>>>> 
>>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <costello@mitre.org 
>>>>> <ma...@mitre.org>> wrote:
>>>>> Excellent! Okay, here’s the use case:
>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>> use ANSI SQL to query the data, join it with other data, do analysis, etc., 
>>>>> just as if it came from a database. So, instead of parsing data to XML and 
>>>>> then using XPath to pull out data, you could instead parse data to Apache 
>>>>> Drill's data representation and then use ANSI SQL to pull out data, and even 
>>>>> combine it with other non-Daffodil data types. The advantage for this would 
>>>>> be that it would make it very easy to enable Drill to query new data types 
>>>>> (IE simply by using a DFDL schema) and it would enable users to easily query 
>>>>> this data without having to load it into another system.
>>>>> How’s that Charles?
>>>>> /Roger
>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It is 
>>>>> regular ANSI SQL.  IMHO, I think this. would be a really great collaboration 
>>>>> of the two communities.
>>>>> --C
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <costello@mitre.org 
>>>>>> <ma...@mitre.org>> wrote:
>>>>>> Thanks again Charles. Is the following use case description correct?
>>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts of 
>>>>>> that data, join it with other data, do analysis, etc., just as if it came 
>>>>>> from a database. So, instead of parsing data to XML and then using XPath to 
>>>>>> pull out data, you could instead parse data to Apache Drill's data 
>>>>>> representation and then use Drills rich data-query capabilities to pull out 
>>>>>> data, and even combine it with other non-Daffodil data types. The advantage 
>>>>>> for this would be that it would make it very easy to enable Drill to query 
>>>>>> new data types (IE simply by using a DFDL schema) and it would enable users 
>>>>>> to easily query this data without having to load it into another system.
>>>>>> Is that correct?
>>>>>> /Roger
>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
>>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>> Not exactly...
>>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
>>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
>>>>>> plugin could be written for Drill that mirrors this schema and ultimately 
>>>>>> reads the data files.  Drill wouldn't be populating any database, but rather 
>>>>>> directly querying the data.
>>>>>> The advantage for this would be that it would make it very easy to enable 
>>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
>>>>>> would enable users to easily query this data w/o having to load it into 
>>>>>> another system.  Does that make sense?
>>>>>> -- C
>>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <costello@mitre.org 
>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
>>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill to 
>>>>>>> query the database.
>>>>>>> Is that correct?
>>>>>>> /Roger
>>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think a 
>>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to enable 
>>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
>>>>>>> applied to other SQL query engines such as Presto and/or Impala.
>>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
>>>>>>> -- C
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <costello@mitre.org 
>>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>>> Thanks Mike! I updated the slide:
>>>>>>>> <image002.png>
>>>>>>>> *From:*Beckerle, Mike <mbeckerle@tresys.com <ma...@tresys.com>>
>>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
>>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>>> I would not pick on RDF data stores as the target.
>>>>>>>> Parsing data to populate a database (any variety) is the actual case. The 
>>>>>>>> fact that we did do one project involving RDF is why I cited that example 
>>>>>>>> in particular but pulling data into any data store/data base begins with 
>>>>>>>> the ability to parse the data, and then process it into suitable form.
>>>>>>>> This is an incomplete list so perhaps this slide title should be "Example 
>>>>>>>> Use Cases for DFDL" ?
>>>>>>>> ...mikeb
>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>> *From:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
>>>>>>>> *To:*users@daffodil.apache.org 
>>>>>>>> <ma...@daffodil.apache.org><users@daffodil.apache.org 
>>>>>>>> <ma...@daffodil.apache.org>>
>>>>>>>> *Subject:*Use cases for DFDL
>>>>>>>> Hi Folks,
>>>>>>>> I created a slide of use cases. See below. Do you agree with the slide? 
>>>>>>>> Anything you would add, delete, or change?  /Roger
>>>>>>>> <image003.png>
>>> 
>>

Re: Use cases for DFDL

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Charles,

Your suggestion to read the schema in each reader can work. In this case, the planner knows nothing about the schema; it is discovered at scan time, by each reader, as the file is read.


Let's take a step back. Drill is designed for big data distributed processing. We might imagine having 100+ files of some DFDL format on HDFS, with, say, 10+ Drillbits reading those files in using, say, 50 scan operators. in separate threads (minor fragments.)

My hunch is that, since the schema is the same for all files, it would be more efficient to read the schema at plan time, then pass the schema along as part of the "physical plan" to each scan operator. That way, in the scenario above, the schema would be read once (by the planner) rather than 100 times (by each reader in each scan operator.)

Further, Drill would know the type of the columns which can avoid ambiguities that occur when types are unknown.

Arina recently added schema support via a "provided schema." We passed this information to the CSV reader so it can operate with a schema. Perhaps we can look at what Arina did and figure out something similar for this use case. Or, maybe even use the DFDL schema in place of the "provided" schema. Someone will need to poke around a bit to figure out the best answer.

Thanks,
- Paul

 

    On Thursday, November 7, 2019, 10:40:39 AM PST, Charles Givre <cg...@gmail.com> wrote:  
 
 @Paul, 
Do you think a format plugin is the right way to integrate this?  My thought was that we could create a folder for dfdl schemata, then the format plugin could specify which schema would be used during read.  IE:

"dfdl" :{
  "type":"dfdl",
  "file":"myschema.dfdl",
  "extensions":["xml"]
}

I was envisioning this working in much the same way as other format plugins that use an external parser.
-- C


> On Nov 7, 2019, at 1:35 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> 
> One thought to add is that if DFDL defines the file schema, then it would be ideal to use that schema at plan time as well as run time. Drill's Calcite integration provides means to do this, though I am personally a bit hazy on the details.
> 
> Certainly getting the reader to work is the first step; thanks Charles for the excellent summary. Then, add the needed Calcite integration to make the schema available to the planner at plan time.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre <cg...@gmail.com> wrote:  
> 
> Hi Steve, 
> Thanks for responding... Here's how Drill reads a file:
> 
> Drill uses what are called "format plugins" which basically read the file in question and map fields to column vectors.  Note:  Drill supports nested data structures, so a column could contain a MAP or LIST. 
> 
> The basic steps are:
> 1.  Open the inputstream and read the file
> 2.  If the schema is known, it is advantageous to define the schema using a schemaBuilder object in advance and create schemaWriters for each column.  In this case, since we'd be using DFDL, we do know the schema so we could create the schema BEFORE the data actually gets read.  If the schema is not known in advance, JSON for instance, Drill can discover the schema as it is reading the data, by dynamically adding column vectors as data is ingested, but that's not the case here... 
> 3.  Once the schema is defined, Drill will then read the file row by row, parse the data, and assign values to each column vector. 
> 
> There are a few more details but that's the essence.  
> 
> What would be great is if we could create a function that could directly map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does natively support JSON, however, it would probably be more effective and efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need some sort of Iterable object so that Drill can map the parsed fields to the schema.  
> 
> If you want to take a look at a relatively simple format plugin take a look here: [2]. This file is the BatchReader which is where most of the heavy lifting takes place.  This plugin is for ESRI Shape files and has a mix of pre-defined fields, nested fields and fields that are defined after reading starts.
> 
> 
> [1]: https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
> [2]: https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>
> 
> 
> I can start a draft PR on the Drill side over the weekend and will share the link to this list.
> Respectfully, 
> -- C
> 
> 
>> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <st...@gmail.com> wrote:
>> 
>> I definitely agree. Apache Drill seems like a logical place to add
>> Daffodil support. And I'm sure many of us, including myself, would be
>> happy to provide some time towards this effort.
>> 
>> The Daffodil API is actually fairly simple and is usually fairly
>> straightforward to integrate--most of the complexity comes from the DFDL
>> schemas. There's a good "hello world" available [1] that shows more API
>> functionality/errors/etc., but the jist of it is:
>> 
>> 1) Compile a DFDL schema to a data processor:
>> 
>>  Compiler c = Daffodil.compiler();
>>  ProcessorFactory pf = c.compileFile(file);
>>  DataProcessor dp = pf.onPath("/");
>> 
>> 2) Create an input source for the data
>> 
>>  InputStream is = ...
>>  InputSourceDataInputStream in = new InputSourceDataInputStream(is);
>> 
>> 3) Create an infoset outputter (we have a handful of differnt kinds)
>> 
>>  JDOMInfosetOutputter out = new JDOMInfosetOutputter();
>> 
>> 4) Use the DataProcessor to parse the input data to the infoset outputter
>> 
>>  ParseResult pr = dataProcessor.parse(in, out)
>> 
>> So I guess the parts that we would need more Drill understanding is what
>> the InfosetOutputter (step 3) needs to look like to better integrate
>> into Drill. Is there a standard data structure that Drill expects
>> representations of data to look like and Drill does the querying on the
>> data structure? And is there some sort of schema that Daffodil would
>> need to create to describe what this structure looks like so it could
>> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
>> this data structure, unless Drill already supports XML or JSON.
>> 
>> Or is it completely up to the Storage Plugin (is that the right term) to
>> determine how to take a Drill query and find the appropriate data from
>> the data store?
>> 
>> - Steve
>> 
>> [1]
>> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
>> 
>> 
>> On 11/3/19 9:31 AM, Charles Givre wrote:
>>> Hi Julian,
>>> It seems like there is a beginning of convergence of the minds here.  I went to 
>>> the Apache Roadshow in DC and that was where I learned about DFDL and 
>>> immediately thought this was a really interesting possibility.
>>> 
>>> I'd love to see if we could foster some collaboration between the various 
>>> projects on this.  From the Drill side of things, it would make it SO much 
>>> easier to get Drill to read (and by extension query) various data types.  I'd be 
>>> willing to contribute time from the Drill side, but I definitely will need help 
>>> understanding how DFDL works.
>>> 
>>> --C
>>> 
>>> 
>>> 
>>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feinauer@pragmaticminds.de 
>>>> <ma...@pragmaticminds.de>> wrote:
>>>> 
>>>> Hi Charles,
>>>> this is an interesting idea and in fact we also discussed the same matter for 
>>>> Calcite at ApacheCon NA.
>>>> But, I agree that it would be really powerful together with a complete Runtime 
>>>> like Drill.
>>>> Julian
>>>> *Von:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>> *Antworten an:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
>>>> *An:*"Costello, Roger L." <costello@mitre.org <ma...@mitre.org>>
>>>> *Cc:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>>> *Betreff:*Re: Use cases for DFDL
>>>> +1
>>>> 
>>>> 
>>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <costello@mitre.org 
>>>>> <ma...@mitre.org>> wrote:
>>>>> Excellent! Okay, here’s the use case:
>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>> use ANSI SQL to query the data, join it with other data, do analysis, etc., 
>>>>> just as if it came from a database. So, instead of parsing data to XML and 
>>>>> then using XPath to pull out data, you could instead parse data to Apache 
>>>>> Drill's data representation and then use ANSI SQL to pull out data, and even 
>>>>> combine it with other non-Daffodil data types. The advantage for this would 
>>>>> be that it would make it very easy to enable Drill to query new data types 
>>>>> (IE simply by using a DFDL schema) and it would enable users to easily query 
>>>>> this data without having to load it into another system.
>>>>> How’s that Charles?
>>>>> /Roger
>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It is 
>>>>> regular ANSI SQL.  IMHO, I think this. would be a really great collaboration 
>>>>> of the two communities.
>>>>> --C
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <costello@mitre.org 
>>>>>> <ma...@mitre.org>> wrote:
>>>>>> Thanks again Charles. Is the following use case description correct?
>>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts of 
>>>>>> that data, join it with other data, do analysis, etc., just as if it came 
>>>>>> from a database. So, instead of parsing data to XML and then using XPath to 
>>>>>> pull out data, you could instead parse data to Apache Drill's data 
>>>>>> representation and then use Drills rich data-query capabilities to pull out 
>>>>>> data, and even combine it with other non-Daffodil data types. The advantage 
>>>>>> for this would be that it would make it very easy to enable Drill to query 
>>>>>> new data types (IE simply by using a DFDL schema) and it would enable users 
>>>>>> to easily query this data without having to load it into another system.
>>>>>> Is that correct?
>>>>>> /Roger
>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
>>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>> Not exactly...
>>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
>>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
>>>>>> plugin could be written for Drill that mirrors this schema and ultimately 
>>>>>> reads the data files.  Drill wouldn't be populating any database, but rather 
>>>>>> directly querying the data.
>>>>>> The advantage for this would be that it would make it very easy to enable 
>>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
>>>>>> would enable users to easily query this data w/o having to load it into 
>>>>>> another system.  Does that make sense?
>>>>>> -- C
>>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <costello@mitre.org 
>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
>>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill to 
>>>>>>> query the database.
>>>>>>> Is that correct?
>>>>>>> /Roger
>>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think a 
>>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to enable 
>>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
>>>>>>> applied to other SQL query engines such as Presto and/or Impala.
>>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
>>>>>>> -- C
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <costello@mitre.org 
>>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>>> Thanks Mike! I updated the slide:
>>>>>>>> <image002.png>
>>>>>>>> *From:*Beckerle, Mike <mbeckerle@tresys.com <ma...@tresys.com>>
>>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
>>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>>> I would not pick on RDF data stores as the target.
>>>>>>>> Parsing data to populate a database (any variety) is the actual case. The 
>>>>>>>> fact that we did do one project involving RDF is why I cited that example 
>>>>>>>> in particular but pulling data into any data store/data base begins with 
>>>>>>>> the ability to parse the data, and then process it into suitable form.
>>>>>>>> This is an incomplete list so perhaps this slide title should be "Example 
>>>>>>>> Use Cases for DFDL" ?
>>>>>>>> ...mikeb
>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>> *From:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
>>>>>>>> *To:*users@daffodil.apache.org 
>>>>>>>> <ma...@daffodil.apache.org><users@daffodil.apache.org 
>>>>>>>> <ma...@daffodil.apache.org>>
>>>>>>>> *Subject:*Use cases for DFDL
>>>>>>>> Hi Folks,
>>>>>>>> I created a slide of use cases. See below. Do you agree with the slide? 
>>>>>>>> Anything you would add, delete, or change?  /Roger
>>>>>>>> <image003.png>
>>> 
>>

Re: Use cases for DFDL

Posted by Charles Givre <cg...@gmail.com>.

@Paul, 
Do you think a format plugin is the right way to integrate this?  My thought was that we could create a folder for dfdl schemata, then the format plugin could specify which schema would be used during read.  IE:

"dfdl" :{
  "type":"dfdl",
  "file":"myschema.dfdl",
  "extensions":["xml"]
}

I was envisioning this working in much the same way as other format plugins that use an external parser.
-- C


> On Nov 7, 2019, at 1:35 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> 
> One thought to add is that if DFDL defines the file schema, then it would be ideal to use that schema at plan time as well as run time. Drill's Calcite integration provides means to do this, though I am personally a bit hazy on the details.
> 
> Certainly getting the reader to work is the first step; thanks Charles for the excellent summary. Then, add the needed Calcite integration to make the schema available to the planner at plan time.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre <cg...@gmail.com> wrote:  
> 
> Hi Steve, 
> Thanks for responding... Here's how Drill reads a file:
> 
> Drill uses what are called "format plugins" which basically read the file in question and map fields to column vectors.  Note:  Drill supports nested data structures, so a column could contain a MAP or LIST. 
> 
> The basic steps are:
> 1.  Open the inputstream and read the file
> 2.  If the schema is known, it is advantageous to define the schema using a schemaBuilder object in advance and create schemaWriters for each column.  In this case, since we'd be using DFDL, we do know the schema so we could create the schema BEFORE the data actually gets read.  If the schema is not known in advance, JSON for instance, Drill can discover the schema as it is reading the data, by dynamically adding column vectors as data is ingested, but that's not the case here... 
> 3.  Once the schema is defined, Drill will then read the file row by row, parse the data, and assign values to each column vector. 
> 
> There are a few more details but that's the essence.  
> 
> What would be great is if we could create a function that could directly map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does natively support JSON, however, it would probably be more effective and efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need some sort of Iterable object so that Drill can map the parsed fields to the schema.  
> 
> If you want to take a look at a relatively simple format plugin take a look here: [2]. This file is the BatchReader which is where most of the heavy lifting takes place.  This plugin is for ESRI Shape files and has a mix of pre-defined fields, nested fields and fields that are defined after reading starts.
> 
> 
> [1]: https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
> [2]: https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>
> 
> 
> I can start a draft PR on the Drill side over the weekend and will share the link to this list.
> Respectfully, 
> -- C
> 
> 
>> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <st...@gmail.com> wrote:
>> 
>> I definitely agree. Apache Drill seems like a logical place to add
>> Daffodil support. And I'm sure many of us, including myself, would be
>> happy to provide some time towards this effort.
>> 
>> The Daffodil API is actually fairly simple and is usually fairly
>> straightforward to integrate--most of the complexity comes from the DFDL
>> schemas. There's a good "hello world" available [1] that shows more API
>> functionality/errors/etc., but the jist of it is:
>> 
>> 1) Compile a DFDL schema to a data processor:
>> 
>>   Compiler c = Daffodil.compiler();
>>   ProcessorFactory pf = c.compileFile(file);
>>   DataProcessor dp = pf.onPath("/");
>> 
>> 2) Create an input source for the data
>> 
>>   InputStream is = ...
>>   InputSourceDataInputStream in = new InputSourceDataInputStream(is);
>> 
>> 3) Create an infoset outputter (we have a handful of differnt kinds)
>> 
>>   JDOMInfosetOutputter out = new JDOMInfosetOutputter();
>> 
>> 4) Use the DataProcessor to parse the input data to the infoset outputter
>> 
>>   ParseResult pr = dataProcessor.parse(in, out)
>> 
>> So I guess the parts that we would need more Drill understanding is what
>> the InfosetOutputter (step 3) needs to look like to better integrate
>> into Drill. Is there a standard data structure that Drill expects
>> representations of data to look like and Drill does the querying on the
>> data structure? And is there some sort of schema that Daffodil would
>> need to create to describe what this structure looks like so it could
>> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
>> this data structure, unless Drill already supports XML or JSON.
>> 
>> Or is it completely up to the Storage Plugin (is that the right term) to
>> determine how to take a Drill query and find the appropriate data from
>> the data store?
>> 
>> - Steve
>> 
>> [1]
>> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
>> 
>> 
>> On 11/3/19 9:31 AM, Charles Givre wrote:
>>> Hi Julian,
>>> It seems like there is a beginning of convergence of the minds here.  I went to 
>>> the Apache Roadshow in DC and that was where I learned about DFDL and 
>>> immediately thought this was a really interesting possibility.
>>> 
>>> I'd love to see if we could foster some collaboration between the various 
>>> projects on this.  From the Drill side of things, it would make it SO much 
>>> easier to get Drill to read (and by extension query) various data types.  I'd be 
>>> willing to contribute time from the Drill side, but I definitely will need help 
>>> understanding how DFDL works.
>>> 
>>> --C
>>> 
>>> 
>>> 
>>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feinauer@pragmaticminds.de 
>>>> <ma...@pragmaticminds.de>> wrote:
>>>> 
>>>> Hi Charles,
>>>> this is an interesting idea and in fact we also discussed the same matter for 
>>>> Calcite at ApacheCon NA.
>>>> But, I agree that it would be really powerful together with a complete Runtime 
>>>> like Drill.
>>>> Julian
>>>> *Von:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>> *Antworten an:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
>>>> *An:*"Costello, Roger L." <costello@mitre.org <ma...@mitre.org>>
>>>> *Cc:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>>> *Betreff:*Re: Use cases for DFDL
>>>> +1
>>>> 
>>>> 
>>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <costello@mitre.org 
>>>>> <ma...@mitre.org>> wrote:
>>>>> Excellent! Okay, here’s the use case:
>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>> use ANSI SQL to query the data, join it with other data, do analysis, etc., 
>>>>> just as if it came from a database. So, instead of parsing data to XML and 
>>>>> then using XPath to pull out data, you could instead parse data to Apache 
>>>>> Drill's data representation and then use ANSI SQL to pull out data, and even 
>>>>> combine it with other non-Daffodil data types. The advantage for this would 
>>>>> be that it would make it very easy to enable Drill to query new data types 
>>>>> (IE simply by using a DFDL schema) and it would enable users to easily query 
>>>>> this data without having to load it into another system.
>>>>> How’s that Charles?
>>>>> /Roger
>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It is 
>>>>> regular ANSI SQL.  IMHO, I think this. would be a really great collaboration 
>>>>> of the two communities.
>>>>> --C
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <costello@mitre.org 
>>>>>> <ma...@mitre.org>> wrote:
>>>>>> Thanks again Charles. Is the following use case description correct?
>>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts of 
>>>>>> that data, join it with other data, do analysis, etc., just as if it came 
>>>>>> from a database. So, instead of parsing data to XML and then using XPath to 
>>>>>> pull out data, you could instead parse data to Apache Drill's data 
>>>>>> representation and then use Drills rich data-query capabilities to pull out 
>>>>>> data, and even combine it with other non-Daffodil data types. The advantage 
>>>>>> for this would be that it would make it very easy to enable Drill to query 
>>>>>> new data types (IE simply by using a DFDL schema) and it would enable users 
>>>>>> to easily query this data without having to load it into another system.
>>>>>> Is that correct?
>>>>>> /Roger
>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
>>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>> Not exactly...
>>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
>>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
>>>>>> plugin could be written for Drill that mirrors this schema and ultimately 
>>>>>> reads the data files.  Drill wouldn't be populating any database, but rather 
>>>>>> directly querying the data.
>>>>>> The advantage for this would be that it would make it very easy to enable 
>>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
>>>>>> would enable users to easily query this data w/o having to load it into 
>>>>>> another system.  Does that make sense?
>>>>>> -- C
>>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <costello@mitre.org 
>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
>>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill to 
>>>>>>> query the database.
>>>>>>> Is that correct?
>>>>>>> /Roger
>>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think a 
>>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to enable 
>>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
>>>>>>> applied to other SQL query engines such as Presto and/or Impala.
>>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
>>>>>>> -- C
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <costello@mitre.org 
>>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>>> Thanks Mike! I updated the slide:
>>>>>>>> <image002.png>
>>>>>>>> *From:*Beckerle, Mike <mbeckerle@tresys.com <ma...@tresys.com>>
>>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
>>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>>> I would not pick on RDF data stores as the target.
>>>>>>>> Parsing data to populate a database (any variety) is the actual case. The 
>>>>>>>> fact that we did do one project involving RDF is why I cited that example 
>>>>>>>> in particular but pulling data into any data store/data base begins with 
>>>>>>>> the ability to parse the data, and then process it into suitable form.
>>>>>>>> This is an incomplete list so perhaps this slide title should be "Example 
>>>>>>>> Use Cases for DFDL" ?
>>>>>>>> ...mikeb
>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>> *From:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
>>>>>>>> *To:*users@daffodil.apache.org 
>>>>>>>> <ma...@daffodil.apache.org><users@daffodil.apache.org 
>>>>>>>> <ma...@daffodil.apache.org>>
>>>>>>>> *Subject:*Use cases for DFDL
>>>>>>>> Hi Folks,
>>>>>>>> I created a slide of use cases. See below. Do you agree with the slide? 
>>>>>>>> Anything you would add, delete, or change?  /Roger
>>>>>>>> <image003.png>
>>> 
>>

Re: Use cases for DFDL

Posted by Charles Givre <cg...@gmail.com>.

@Paul, 
Do you think a format plugin is the right way to integrate this?  My thought was that we could create a folder for dfdl schemata, then the format plugin could specify which schema would be used during read.  IE:

"dfdl" :{
  "type":"dfdl",
  "file":"myschema.dfdl",
  "extensions":["xml"]
}

I was envisioning this working in much the same way as other format plugins that use an external parser.
-- C


> On Nov 7, 2019, at 1:35 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> 
> One thought to add is that if DFDL defines the file schema, then it would be ideal to use that schema at plan time as well as run time. Drill's Calcite integration provides means to do this, though I am personally a bit hazy on the details.
> 
> Certainly getting the reader to work is the first step; thanks Charles for the excellent summary. Then, add the needed Calcite integration to make the schema available to the planner at plan time.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre <cg...@gmail.com> wrote:  
> 
> Hi Steve, 
> Thanks for responding... Here's how Drill reads a file:
> 
> Drill uses what are called "format plugins" which basically read the file in question and map fields to column vectors.  Note:  Drill supports nested data structures, so a column could contain a MAP or LIST. 
> 
> The basic steps are:
> 1.  Open the inputstream and read the file
> 2.  If the schema is known, it is advantageous to define the schema using a schemaBuilder object in advance and create schemaWriters for each column.  In this case, since we'd be using DFDL, we do know the schema so we could create the schema BEFORE the data actually gets read.  If the schema is not known in advance, JSON for instance, Drill can discover the schema as it is reading the data, by dynamically adding column vectors as data is ingested, but that's not the case here... 
> 3.  Once the schema is defined, Drill will then read the file row by row, parse the data, and assign values to each column vector. 
> 
> There are a few more details but that's the essence.  
> 
> What would be great is if we could create a function that could directly map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does natively support JSON, however, it would probably be more effective and efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need some sort of Iterable object so that Drill can map the parsed fields to the schema.  
> 
> If you want to take a look at a relatively simple format plugin take a look here: [2]. This file is the BatchReader which is where most of the heavy lifting takes place.  This plugin is for ESRI Shape files and has a mix of pre-defined fields, nested fields and fields that are defined after reading starts.
> 
> 
> [1]: https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
> [2]: https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>
> 
> 
> I can start a draft PR on the Drill side over the weekend and will share the link to this list.
> Respectfully, 
> -- C
> 
> 
>> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <st...@gmail.com> wrote:
>> 
>> I definitely agree. Apache Drill seems like a logical place to add
>> Daffodil support. And I'm sure many of us, including myself, would be
>> happy to provide some time towards this effort.
>> 
>> The Daffodil API is actually fairly simple and is usually fairly
>> straightforward to integrate--most of the complexity comes from the DFDL
>> schemas. There's a good "hello world" available [1] that shows more API
>> functionality/errors/etc., but the jist of it is:
>> 
>> 1) Compile a DFDL schema to a data processor:
>> 
>>   Compiler c = Daffodil.compiler();
>>   ProcessorFactory pf = c.compileFile(file);
>>   DataProcessor dp = pf.onPath("/");
>> 
>> 2) Create an input source for the data
>> 
>>   InputStream is = ...
>>   InputSourceDataInputStream in = new InputSourceDataInputStream(is);
>> 
>> 3) Create an infoset outputter (we have a handful of differnt kinds)
>> 
>>   JDOMInfosetOutputter out = new JDOMInfosetOutputter();
>> 
>> 4) Use the DataProcessor to parse the input data to the infoset outputter
>> 
>>   ParseResult pr = dataProcessor.parse(in, out)
>> 
>> So I guess the parts that we would need more Drill understanding is what
>> the InfosetOutputter (step 3) needs to look like to better integrate
>> into Drill. Is there a standard data structure that Drill expects
>> representations of data to look like and Drill does the querying on the
>> data structure? And is there some sort of schema that Daffodil would
>> need to create to describe what this structure looks like so it could
>> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
>> this data structure, unless Drill already supports XML or JSON.
>> 
>> Or is it completely up to the Storage Plugin (is that the right term) to
>> determine how to take a Drill query and find the appropriate data from
>> the data store?
>> 
>> - Steve
>> 
>> [1]
>> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
>> 
>> 
>> On 11/3/19 9:31 AM, Charles Givre wrote:
>>> Hi Julian,
>>> It seems like there is a beginning of convergence of the minds here.  I went to 
>>> the Apache Roadshow in DC and that was where I learned about DFDL and 
>>> immediately thought this was a really interesting possibility.
>>> 
>>> I'd love to see if we could foster some collaboration between the various 
>>> projects on this.  From the Drill side of things, it would make it SO much 
>>> easier to get Drill to read (and by extension query) various data types.  I'd be 
>>> willing to contribute time from the Drill side, but I definitely will need help 
>>> understanding how DFDL works.
>>> 
>>> --C
>>> 
>>> 
>>> 
>>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feinauer@pragmaticminds.de 
>>>> <ma...@pragmaticminds.de>> wrote:
>>>> 
>>>> Hi Charles,
>>>> this is an interesting idea and in fact we also discussed the same matter for 
>>>> Calcite at ApacheCon NA.
>>>> But, I agree that it would be really powerful together with a complete Runtime 
>>>> like Drill.
>>>> Julian
>>>> *Von:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>> *Antworten an:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
>>>> *An:*"Costello, Roger L." <costello@mitre.org <ma...@mitre.org>>
>>>> *Cc:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>>> *Betreff:*Re: Use cases for DFDL
>>>> +1
>>>> 
>>>> 
>>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <costello@mitre.org 
>>>>> <ma...@mitre.org>> wrote:
>>>>> Excellent! Okay, here’s the use case:
>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>> use ANSI SQL to query the data, join it with other data, do analysis, etc., 
>>>>> just as if it came from a database. So, instead of parsing data to XML and 
>>>>> then using XPath to pull out data, you could instead parse data to Apache 
>>>>> Drill's data representation and then use ANSI SQL to pull out data, and even 
>>>>> combine it with other non-Daffodil data types. The advantage for this would 
>>>>> be that it would make it very easy to enable Drill to query new data types 
>>>>> (IE simply by using a DFDL schema) and it would enable users to easily query 
>>>>> this data without having to load it into another system.
>>>>> How’s that Charles?
>>>>> /Roger
>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It is 
>>>>> regular ANSI SQL.  IMHO, I think this. would be a really great collaboration 
>>>>> of the two communities.
>>>>> --C
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <costello@mitre.org 
>>>>>> <ma...@mitre.org>> wrote:
>>>>>> Thanks again Charles. Is the following use case description correct?
>>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts of 
>>>>>> that data, join it with other data, do analysis, etc., just as if it came 
>>>>>> from a database. So, instead of parsing data to XML and then using XPath to 
>>>>>> pull out data, you could instead parse data to Apache Drill's data 
>>>>>> representation and then use Drills rich data-query capabilities to pull out 
>>>>>> data, and even combine it with other non-Daffodil data types. The advantage 
>>>>>> for this would be that it would make it very easy to enable Drill to query 
>>>>>> new data types (IE simply by using a DFDL schema) and it would enable users 
>>>>>> to easily query this data without having to load it into another system.
>>>>>> Is that correct?
>>>>>> /Roger
>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
>>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>> Not exactly...
>>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
>>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
>>>>>> plugin could be written for Drill that mirrors this schema and ultimately 
>>>>>> reads the data files.  Drill wouldn't be populating any database, but rather 
>>>>>> directly querying the data.
>>>>>> The advantage for this would be that it would make it very easy to enable 
>>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
>>>>>> would enable users to easily query this data w/o having to load it into 
>>>>>> another system.  Does that make sense?
>>>>>> -- C
>>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <costello@mitre.org 
>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
>>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill to 
>>>>>>> query the database.
>>>>>>> Is that correct?
>>>>>>> /Roger
>>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think a 
>>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to enable 
>>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
>>>>>>> applied to other SQL query engines such as Presto and/or Impala.
>>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
>>>>>>> -- C
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <costello@mitre.org 
>>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>>> Thanks Mike! I updated the slide:
>>>>>>>> <image002.png>
>>>>>>>> *From:*Beckerle, Mike <mbeckerle@tresys.com <ma...@tresys.com>>
>>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
>>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>>> I would not pick on RDF data stores as the target.
>>>>>>>> Parsing data to populate a database (any variety) is the actual case. The 
>>>>>>>> fact that we did do one project involving RDF is why I cited that example 
>>>>>>>> in particular but pulling data into any data store/data base begins with 
>>>>>>>> the ability to parse the data, and then process it into suitable form.
>>>>>>>> This is an incomplete list so perhaps this slide title should be "Example 
>>>>>>>> Use Cases for DFDL" ?
>>>>>>>> ...mikeb
>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>> *From:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
>>>>>>>> *To:*users@daffodil.apache.org 
>>>>>>>> <ma...@daffodil.apache.org><users@daffodil.apache.org 
>>>>>>>> <ma...@daffodil.apache.org>>
>>>>>>>> *Subject:*Use cases for DFDL
>>>>>>>> Hi Folks,
>>>>>>>> I created a slide of use cases. See below. Do you agree with the slide? 
>>>>>>>> Anything you would add, delete, or change?  /Roger
>>>>>>>> <image003.png>
>>> 
>>

Re: Use cases for DFDL

Posted by Charles Givre <cg...@gmail.com>.

@Paul, 
Do you think a format plugin is the right way to integrate this?  My thought was that we could create a folder for dfdl schemata, then the format plugin could specify which schema would be used during read.  IE:

"dfdl" :{
  "type":"dfdl",
  "file":"myschema.dfdl",
  "extensions":["xml"]
}

I was envisioning this working in much the same way as other format plugins that use an external parser.
-- C


> On Nov 7, 2019, at 1:35 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> 
> One thought to add is that if DFDL defines the file schema, then it would be ideal to use that schema at plan time as well as run time. Drill's Calcite integration provides means to do this, though I am personally a bit hazy on the details.
> 
> Certainly getting the reader to work is the first step; thanks Charles for the excellent summary. Then, add the needed Calcite integration to make the schema available to the planner at plan time.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre <cg...@gmail.com> wrote:  
> 
> Hi Steve, 
> Thanks for responding... Here's how Drill reads a file:
> 
> Drill uses what are called "format plugins" which basically read the file in question and map fields to column vectors.  Note:  Drill supports nested data structures, so a column could contain a MAP or LIST. 
> 
> The basic steps are:
> 1.  Open the inputstream and read the file
> 2.  If the schema is known, it is advantageous to define the schema using a schemaBuilder object in advance and create schemaWriters for each column.  In this case, since we'd be using DFDL, we do know the schema so we could create the schema BEFORE the data actually gets read.  If the schema is not known in advance, JSON for instance, Drill can discover the schema as it is reading the data, by dynamically adding column vectors as data is ingested, but that's not the case here... 
> 3.  Once the schema is defined, Drill will then read the file row by row, parse the data, and assign values to each column vector. 
> 
> There are a few more details but that's the essence.  
> 
> What would be great is if we could create a function that could directly map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does natively support JSON, however, it would probably be more effective and efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need some sort of Iterable object so that Drill can map the parsed fields to the schema.  
> 
> If you want to take a look at a relatively simple format plugin take a look here: [2]. This file is the BatchReader which is where most of the heavy lifting takes place.  This plugin is for ESRI Shape files and has a mix of pre-defined fields, nested fields and fields that are defined after reading starts.
> 
> 
> [1]: https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
> [2]: https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>
> 
> 
> I can start a draft PR on the Drill side over the weekend and will share the link to this list.
> Respectfully, 
> -- C
> 
> 
>> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <st...@gmail.com> wrote:
>> 
>> I definitely agree. Apache Drill seems like a logical place to add
>> Daffodil support. And I'm sure many of us, including myself, would be
>> happy to provide some time towards this effort.
>> 
>> The Daffodil API is actually fairly simple and is usually fairly
>> straightforward to integrate--most of the complexity comes from the DFDL
>> schemas. There's a good "hello world" available [1] that shows more API
>> functionality/errors/etc., but the jist of it is:
>> 
>> 1) Compile a DFDL schema to a data processor:
>> 
>>   Compiler c = Daffodil.compiler();
>>   ProcessorFactory pf = c.compileFile(file);
>>   DataProcessor dp = pf.onPath("/");
>> 
>> 2) Create an input source for the data
>> 
>>   InputStream is = ...
>>   InputSourceDataInputStream in = new InputSourceDataInputStream(is);
>> 
>> 3) Create an infoset outputter (we have a handful of differnt kinds)
>> 
>>   JDOMInfosetOutputter out = new JDOMInfosetOutputter();
>> 
>> 4) Use the DataProcessor to parse the input data to the infoset outputter
>> 
>>   ParseResult pr = dataProcessor.parse(in, out)
>> 
>> So I guess the parts that we would need more Drill understanding is what
>> the InfosetOutputter (step 3) needs to look like to better integrate
>> into Drill. Is there a standard data structure that Drill expects
>> representations of data to look like and Drill does the querying on the
>> data structure? And is there some sort of schema that Daffodil would
>> need to create to describe what this structure looks like so it could
>> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
>> this data structure, unless Drill already supports XML or JSON.
>> 
>> Or is it completely up to the Storage Plugin (is that the right term) to
>> determine how to take a Drill query and find the appropriate data from
>> the data store?
>> 
>> - Steve
>> 
>> [1]
>> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
>> 
>> 
>> On 11/3/19 9:31 AM, Charles Givre wrote:
>>> Hi Julian,
>>> It seems like there is a beginning of convergence of the minds here.  I went to 
>>> the Apache Roadshow in DC and that was where I learned about DFDL and 
>>> immediately thought this was a really interesting possibility.
>>> 
>>> I'd love to see if we could foster some collaboration between the various 
>>> projects on this.  From the Drill side of things, it would make it SO much 
>>> easier to get Drill to read (and by extension query) various data types.  I'd be 
>>> willing to contribute time from the Drill side, but I definitely will need help 
>>> understanding how DFDL works.
>>> 
>>> --C
>>> 
>>> 
>>> 
>>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feinauer@pragmaticminds.de 
>>>> <ma...@pragmaticminds.de>> wrote:
>>>> 
>>>> Hi Charles,
>>>> this is an interesting idea and in fact we also discussed the same matter for 
>>>> Calcite at ApacheCon NA.
>>>> But, I agree that it would be really powerful together with a complete Runtime 
>>>> like Drill.
>>>> Julian
>>>> *Von:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>> *Antworten an:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
>>>> *An:*"Costello, Roger L." <costello@mitre.org <ma...@mitre.org>>
>>>> *Cc:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>>> *Betreff:*Re: Use cases for DFDL
>>>> +1
>>>> 
>>>> 
>>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <costello@mitre.org 
>>>>> <ma...@mitre.org>> wrote:
>>>>> Excellent! Okay, here’s the use case:
>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>> use ANSI SQL to query the data, join it with other data, do analysis, etc., 
>>>>> just as if it came from a database. So, instead of parsing data to XML and 
>>>>> then using XPath to pull out data, you could instead parse data to Apache 
>>>>> Drill's data representation and then use ANSI SQL to pull out data, and even 
>>>>> combine it with other non-Daffodil data types. The advantage for this would 
>>>>> be that it would make it very easy to enable Drill to query new data types 
>>>>> (IE simply by using a DFDL schema) and it would enable users to easily query 
>>>>> this data without having to load it into another system.
>>>>> How’s that Charles?
>>>>> /Roger
>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It is 
>>>>> regular ANSI SQL.  IMHO, I think this. would be a really great collaboration 
>>>>> of the two communities.
>>>>> --C
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <costello@mitre.org 
>>>>>> <ma...@mitre.org>> wrote:
>>>>>> Thanks again Charles. Is the following use case description correct?
>>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts of 
>>>>>> that data, join it with other data, do analysis, etc., just as if it came 
>>>>>> from a database. So, instead of parsing data to XML and then using XPath to 
>>>>>> pull out data, you could instead parse data to Apache Drill's data 
>>>>>> representation and then use Drills rich data-query capabilities to pull out 
>>>>>> data, and even combine it with other non-Daffodil data types. The advantage 
>>>>>> for this would be that it would make it very easy to enable Drill to query 
>>>>>> new data types (IE simply by using a DFDL schema) and it would enable users 
>>>>>> to easily query this data without having to load it into another system.
>>>>>> Is that correct?
>>>>>> /Roger
>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
>>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>> Not exactly...
>>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
>>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
>>>>>> plugin could be written for Drill that mirrors this schema and ultimately 
>>>>>> reads the data files.  Drill wouldn't be populating any database, but rather 
>>>>>> directly querying the data.
>>>>>> The advantage for this would be that it would make it very easy to enable 
>>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
>>>>>> would enable users to easily query this data w/o having to load it into 
>>>>>> another system.  Does that make sense?
>>>>>> -- C
>>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <costello@mitre.org 
>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
>>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill to 
>>>>>>> query the database.
>>>>>>> Is that correct?
>>>>>>> /Roger
>>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think a 
>>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to enable 
>>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
>>>>>>> applied to other SQL query engines such as Presto and/or Impala.
>>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
>>>>>>> -- C
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <costello@mitre.org 
>>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>>> Thanks Mike! I updated the slide:
>>>>>>>> <image002.png>
>>>>>>>> *From:*Beckerle, Mike <mbeckerle@tresys.com <ma...@tresys.com>>
>>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
>>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>>> I would not pick on RDF data stores as the target.
>>>>>>>> Parsing data to populate a database (any variety) is the actual case. The 
>>>>>>>> fact that we did do one project involving RDF is why I cited that example 
>>>>>>>> in particular but pulling data into any data store/data base begins with 
>>>>>>>> the ability to parse the data, and then process it into suitable form.
>>>>>>>> This is an incomplete list so perhaps this slide title should be "Example 
>>>>>>>> Use Cases for DFDL" ?
>>>>>>>> ...mikeb
>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>> *From:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
>>>>>>>> *To:*users@daffodil.apache.org 
>>>>>>>> <ma...@daffodil.apache.org><users@daffodil.apache.org 
>>>>>>>> <ma...@daffodil.apache.org>>
>>>>>>>> *Subject:*Use cases for DFDL
>>>>>>>> Hi Folks,
>>>>>>>> I created a slide of use cases. See below. Do you agree with the slide? 
>>>>>>>> Anything you would add, delete, or change?  /Roger
>>>>>>>> <image003.png>
>>> 
>>

Re: Use cases for DFDL

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi All,

One thought to add is that if DFDL defines the file schema, then it would be ideal to use that schema at plan time as well as run time. Drill's Calcite integration provides means to do this, though I am personally a bit hazy on the details.

Certainly getting the reader to work is the first step; thanks Charles for the excellent summary. Then, add the needed Calcite integration to make the schema available to the planner at plan time.

Thanks,
- Paul

 

    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre <cg...@gmail.com> wrote:  
 
 Hi Steve, 
Thanks for responding... Here's how Drill reads a file:

Drill uses what are called "format plugins" which basically read the file in question and map fields to column vectors.  Note:  Drill supports nested data structures, so a column could contain a MAP or LIST. 

The basic steps are:
1.  Open the inputstream and read the file
2.  If the schema is known, it is advantageous to define the schema using a schemaBuilder object in advance and create schemaWriters for each column.  In this case, since we'd be using DFDL, we do know the schema so we could create the schema BEFORE the data actually gets read.  If the schema is not known in advance, JSON for instance, Drill can discover the schema as it is reading the data, by dynamically adding column vectors as data is ingested, but that's not the case here... 
3.  Once the schema is defined, Drill will then read the file row by row, parse the data, and assign values to each column vector. 

There are a few more details but that's the essence.  

What would be great is if we could create a function that could directly map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does natively support JSON, however, it would probably be more effective and efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need some sort of Iterable object so that Drill can map the parsed fields to the schema.  

If you want to take a look at a relatively simple format plugin take a look here: [2]. This file is the BatchReader which is where most of the heavy lifting takes place.  This plugin is for ESRI Shape files and has a mix of pre-defined fields, nested fields and fields that are defined after reading starts.


[1]: https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
[2]: https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>


I can start a draft PR on the Drill side over the weekend and will share the link to this list.
Respectfully, 
-- C


> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <st...@gmail.com> wrote:
> 
> I definitely agree. Apache Drill seems like a logical place to add
> Daffodil support. And I'm sure many of us, including myself, would be
> happy to provide some time towards this effort.
> 
> The Daffodil API is actually fairly simple and is usually fairly
> straightforward to integrate--most of the complexity comes from the DFDL
> schemas. There's a good "hello world" available [1] that shows more API
> functionality/errors/etc., but the jist of it is:
> 
> 1) Compile a DFDL schema to a data processor:
> 
>  Compiler c = Daffodil.compiler();
>  ProcessorFactory pf = c.compileFile(file);
>  DataProcessor dp = pf.onPath("/");
> 
> 2) Create an input source for the data
> 
>  InputStream is = ...
>  InputSourceDataInputStream in = new InputSourceDataInputStream(is);
> 
> 3) Create an infoset outputter (we have a handful of differnt kinds)
> 
>  JDOMInfosetOutputter out = new JDOMInfosetOutputter();
> 
> 4) Use the DataProcessor to parse the input data to the infoset outputter
> 
>  ParseResult pr = dataProcessor.parse(in, out)
> 
> So I guess the parts that we would need more Drill understanding is what
> the InfosetOutputter (step 3) needs to look like to better integrate
> into Drill. Is there a standard data structure that Drill expects
> representations of data to look like and Drill does the querying on the
> data structure? And is there some sort of schema that Daffodil would
> need to create to describe what this structure looks like so it could
> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
> this data structure, unless Drill already supports XML or JSON.
> 
> Or is it completely up to the Storage Plugin (is that the right term) to
> determine how to take a Drill query and find the appropriate data from
> the data store?
> 
> - Steve
> 
> [1]
> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
> 
> 
> On 11/3/19 9:31 AM, Charles Givre wrote:
>> Hi Julian,
>> It seems like there is a beginning of convergence of the minds here.  I went to 
>> the Apache Roadshow in DC and that was where I learned about DFDL and 
>> immediately thought this was a really interesting possibility.
>> 
>> I'd love to see if we could foster some collaboration between the various 
>> projects on this.  From the Drill side of things, it would make it SO much 
>> easier to get Drill to read (and by extension query) various data types.  I'd be 
>> willing to contribute time from the Drill side, but I definitely will need help 
>> understanding how DFDL works.
>> 
>> --C
>> 
>> 
>> 
>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feinauer@pragmaticminds.de 
>>> <ma...@pragmaticminds.de>> wrote:
>>> 
>>> Hi Charles,
>>> this is an interesting idea and in fact we also discussed the same matter for 
>>> Calcite at ApacheCon NA.
>>> But, I agree that it would be really powerful together with a complete Runtime 
>>> like Drill.
>>> Julian
>>> *Von:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>> *Antworten an:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
>>> *An:*"Costello, Roger L." <costello@mitre.org <ma...@mitre.org>>
>>> *Cc:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>> *Betreff:*Re: Use cases for DFDL
>>> +1
>>> 
>>> 
>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <costello@mitre.org 
>>>> <ma...@mitre.org>> wrote:
>>>> Excellent! Okay, here’s the use case:
>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>> use ANSI SQL to query the data, join it with other data, do analysis, etc., 
>>>> just as if it came from a database. So, instead of parsing data to XML and 
>>>> then using XPath to pull out data, you could instead parse data to Apache 
>>>> Drill's data representation and then use ANSI SQL to pull out data, and even 
>>>> combine it with other non-Daffodil data types. The advantage for this would 
>>>> be that it would make it very easy to enable Drill to query new data types 
>>>> (IE simply by using a DFDL schema) and it would enable users to easily query 
>>>> this data without having to load it into another system.
>>>> How’s that Charles?
>>>> /Roger
>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It is 
>>>> regular ANSI SQL.  IMHO, I think this. would be a really great collaboration 
>>>> of the two communities.
>>>> --C
>>>> 
>>>> 
>>>> 
>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <costello@mitre.org 
>>>>> <ma...@mitre.org>> wrote:
>>>>> Thanks again Charles. Is the following use case description correct?
>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts of 
>>>>> that data, join it with other data, do analysis, etc., just as if it came 
>>>>> from a database. So, instead of parsing data to XML and then using XPath to 
>>>>> pull out data, you could instead parse data to Apache Drill's data 
>>>>> representation and then use Drills rich data-query capabilities to pull out 
>>>>> data, and even combine it with other non-Daffodil data types. The advantage 
>>>>> for this would be that it would make it very easy to enable Drill to query 
>>>>> new data types (IE simply by using a DFDL schema) and it would enable users 
>>>>> to easily query this data without having to load it into another system.
>>>>> Is that correct?
>>>>> /Roger
>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>> Not exactly...
>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
>>>>> plugin could be written for Drill that mirrors this schema and ultimately 
>>>>> reads the data files.  Drill wouldn't be populating any database, but rather 
>>>>> directly querying the data.
>>>>> The advantage for this would be that it would make it very easy to enable 
>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
>>>>> would enable users to easily query this data w/o having to load it into 
>>>>> another system.  Does that make sense?
>>>>> -- C
>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <costello@mitre.org 
>>>>>> <ma...@mitre.org>> wrote:
>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill to 
>>>>>> query the database.
>>>>>> Is that correct?
>>>>>> /Roger
>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think a 
>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to enable 
>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
>>>>>> applied to other SQL query engines such as Presto and/or Impala.
>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
>>>>>> -- C
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <costello@mitre.org 
>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>> Thanks Mike! I updated the slide:
>>>>>>> <image002.png>
>>>>>>> *From:*Beckerle, Mike <mbeckerle@tresys.com <ma...@tresys.com>>
>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>> I would not pick on RDF data stores as the target.
>>>>>>> Parsing data to populate a database (any variety) is the actual case. The 
>>>>>>> fact that we did do one project involving RDF is why I cited that example 
>>>>>>> in particular but pulling data into any data store/data base begins with 
>>>>>>> the ability to parse the data, and then process it into suitable form.
>>>>>>> This is an incomplete list so perhaps this slide title should be "Example 
>>>>>>> Use Cases for DFDL" ?
>>>>>>> ...mikeb
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> *From:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
>>>>>>> *To:*users@daffodil.apache.org 
>>>>>>> <ma...@daffodil.apache.org><users@daffodil.apache.org 
>>>>>>> <ma...@daffodil.apache.org>>
>>>>>>> *Subject:*Use cases for DFDL
>>>>>>> Hi Folks,
>>>>>>> I created a slide of use cases. See below. Do you agree with the slide? 
>>>>>>> Anything you would add, delete, or change?  /Roger
>>>>>>> <image003.png>
>> 
>

Re: Use cases for DFDL

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi All,

One thought to add is that if DFDL defines the file schema, then it would be ideal to use that schema at plan time as well as run time. Drill's Calcite integration provides means to do this, though I am personally a bit hazy on the details.

Certainly getting the reader to work is the first step; thanks Charles for the excellent summary. Then, add the needed Calcite integration to make the schema available to the planner at plan time.

Thanks,
- Paul

 

    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre <cg...@gmail.com> wrote:  
 
 Hi Steve, 
Thanks for responding... Here's how Drill reads a file:

Drill uses what are called "format plugins" which basically read the file in question and map fields to column vectors.  Note:  Drill supports nested data structures, so a column could contain a MAP or LIST. 

The basic steps are:
1.  Open the inputstream and read the file
2.  If the schema is known, it is advantageous to define the schema using a schemaBuilder object in advance and create schemaWriters for each column.  In this case, since we'd be using DFDL, we do know the schema so we could create the schema BEFORE the data actually gets read.  If the schema is not known in advance, JSON for instance, Drill can discover the schema as it is reading the data, by dynamically adding column vectors as data is ingested, but that's not the case here... 
3.  Once the schema is defined, Drill will then read the file row by row, parse the data, and assign values to each column vector. 

There are a few more details but that's the essence.  

What would be great is if we could create a function that could directly map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does natively support JSON, however, it would probably be more effective and efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need some sort of Iterable object so that Drill can map the parsed fields to the schema.  

If you want to take a look at a relatively simple format plugin take a look here: [2]. This file is the BatchReader which is where most of the heavy lifting takes place.  This plugin is for ESRI Shape files and has a mix of pre-defined fields, nested fields and fields that are defined after reading starts.


[1]: https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
[2]: https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>


I can start a draft PR on the Drill side over the weekend and will share the link to this list.
Respectfully, 
-- C


> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <st...@gmail.com> wrote:
> 
> I definitely agree. Apache Drill seems like a logical place to add
> Daffodil support. And I'm sure many of us, including myself, would be
> happy to provide some time towards this effort.
> 
> The Daffodil API is actually fairly simple and is usually fairly
> straightforward to integrate--most of the complexity comes from the DFDL
> schemas. There's a good "hello world" available [1] that shows more API
> functionality/errors/etc., but the jist of it is:
> 
> 1) Compile a DFDL schema to a data processor:
> 
>  Compiler c = Daffodil.compiler();
>  ProcessorFactory pf = c.compileFile(file);
>  DataProcessor dp = pf.onPath("/");
> 
> 2) Create an input source for the data
> 
>  InputStream is = ...
>  InputSourceDataInputStream in = new InputSourceDataInputStream(is);
> 
> 3) Create an infoset outputter (we have a handful of differnt kinds)
> 
>  JDOMInfosetOutputter out = new JDOMInfosetOutputter();
> 
> 4) Use the DataProcessor to parse the input data to the infoset outputter
> 
>  ParseResult pr = dataProcessor.parse(in, out)
> 
> So I guess the parts that we would need more Drill understanding is what
> the InfosetOutputter (step 3) needs to look like to better integrate
> into Drill. Is there a standard data structure that Drill expects
> representations of data to look like and Drill does the querying on the
> data structure? And is there some sort of schema that Daffodil would
> need to create to describe what this structure looks like so it could
> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
> this data structure, unless Drill already supports XML or JSON.
> 
> Or is it completely up to the Storage Plugin (is that the right term) to
> determine how to take a Drill query and find the appropriate data from
> the data store?
> 
> - Steve
> 
> [1]
> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
> 
> 
> On 11/3/19 9:31 AM, Charles Givre wrote:
>> Hi Julian,
>> It seems like there is a beginning of convergence of the minds here.  I went to 
>> the Apache Roadshow in DC and that was where I learned about DFDL and 
>> immediately thought this was a really interesting possibility.
>> 
>> I'd love to see if we could foster some collaboration between the various 
>> projects on this.  From the Drill side of things, it would make it SO much 
>> easier to get Drill to read (and by extension query) various data types.  I'd be 
>> willing to contribute time from the Drill side, but I definitely will need help 
>> understanding how DFDL works.
>> 
>> --C
>> 
>> 
>> 
>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feinauer@pragmaticminds.de 
>>> <ma...@pragmaticminds.de>> wrote:
>>> 
>>> Hi Charles,
>>> this is an interesting idea and in fact we also discussed the same matter for 
>>> Calcite at ApacheCon NA.
>>> But, I agree that it would be really powerful together with a complete Runtime 
>>> like Drill.
>>> Julian
>>> *Von:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>> *Antworten an:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
>>> *An:*"Costello, Roger L." <costello@mitre.org <ma...@mitre.org>>
>>> *Cc:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>> *Betreff:*Re: Use cases for DFDL
>>> +1
>>> 
>>> 
>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <costello@mitre.org 
>>>> <ma...@mitre.org>> wrote:
>>>> Excellent! Okay, here’s the use case:
>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>> use ANSI SQL to query the data, join it with other data, do analysis, etc., 
>>>> just as if it came from a database. So, instead of parsing data to XML and 
>>>> then using XPath to pull out data, you could instead parse data to Apache 
>>>> Drill's data representation and then use ANSI SQL to pull out data, and even 
>>>> combine it with other non-Daffodil data types. The advantage for this would 
>>>> be that it would make it very easy to enable Drill to query new data types 
>>>> (IE simply by using a DFDL schema) and it would enable users to easily query 
>>>> this data without having to load it into another system.
>>>> How’s that Charles?
>>>> /Roger
>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It is 
>>>> regular ANSI SQL.  IMHO, I think this. would be a really great collaboration 
>>>> of the two communities.
>>>> --C
>>>> 
>>>> 
>>>> 
>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <costello@mitre.org 
>>>>> <ma...@mitre.org>> wrote:
>>>>> Thanks again Charles. Is the following use case description correct?
>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts of 
>>>>> that data, join it with other data, do analysis, etc., just as if it came 
>>>>> from a database. So, instead of parsing data to XML and then using XPath to 
>>>>> pull out data, you could instead parse data to Apache Drill's data 
>>>>> representation and then use Drills rich data-query capabilities to pull out 
>>>>> data, and even combine it with other non-Daffodil data types. The advantage 
>>>>> for this would be that it would make it very easy to enable Drill to query 
>>>>> new data types (IE simply by using a DFDL schema) and it would enable users 
>>>>> to easily query this data without having to load it into another system.
>>>>> Is that correct?
>>>>> /Roger
>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>> Not exactly...
>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
>>>>> plugin could be written for Drill that mirrors this schema and ultimately 
>>>>> reads the data files.  Drill wouldn't be populating any database, but rather 
>>>>> directly querying the data.
>>>>> The advantage for this would be that it would make it very easy to enable 
>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
>>>>> would enable users to easily query this data w/o having to load it into 
>>>>> another system.  Does that make sense?
>>>>> -- C
>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <costello@mitre.org 
>>>>>> <ma...@mitre.org>> wrote:
>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill to 
>>>>>> query the database.
>>>>>> Is that correct?
>>>>>> /Roger
>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think a 
>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to enable 
>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
>>>>>> applied to other SQL query engines such as Presto and/or Impala.
>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
>>>>>> -- C
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <costello@mitre.org 
>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>> Thanks Mike! I updated the slide:
>>>>>>> <image002.png>
>>>>>>> *From:*Beckerle, Mike <mbeckerle@tresys.com <ma...@tresys.com>>
>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>> I would not pick on RDF data stores as the target.
>>>>>>> Parsing data to populate a database (any variety) is the actual case. The 
>>>>>>> fact that we did do one project involving RDF is why I cited that example 
>>>>>>> in particular but pulling data into any data store/data base begins with 
>>>>>>> the ability to parse the data, and then process it into suitable form.
>>>>>>> This is an incomplete list so perhaps this slide title should be "Example 
>>>>>>> Use Cases for DFDL" ?
>>>>>>> ...mikeb
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> *From:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
>>>>>>> *To:*users@daffodil.apache.org 
>>>>>>> <ma...@daffodil.apache.org><users@daffodil.apache.org 
>>>>>>> <ma...@daffodil.apache.org>>
>>>>>>> *Subject:*Use cases for DFDL
>>>>>>> Hi Folks,
>>>>>>> I created a slide of use cases. See below. Do you agree with the slide? 
>>>>>>> Anything you would add, delete, or change?  /Roger
>>>>>>> <image003.png>
>> 
>

Re: Use cases for DFDL

Posted by Paul Rogers <pa...@yahoo.com>.

Hi All,

One thought to add is that if DFDL defines the file schema, then it would be ideal to use that schema at plan time as well as run time. Drill's Calcite integration provides means to do this, though I am personally a bit hazy on the details.

Certainly getting the reader to work is the first step; thanks Charles for the excellent summary. Then, add the needed Calcite integration to make the schema available to the planner at plan time.

Thanks,
- Paul

 

    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre <cg...@gmail.com> wrote:  
 
 Hi Steve, 
Thanks for responding... Here's how Drill reads a file:

Drill uses what are called "format plugins" which basically read the file in question and map fields to column vectors.  Note:  Drill supports nested data structures, so a column could contain a MAP or LIST. 

The basic steps are:
1.  Open the inputstream and read the file
2.  If the schema is known, it is advantageous to define the schema using a schemaBuilder object in advance and create schemaWriters for each column.  In this case, since we'd be using DFDL, we do know the schema so we could create the schema BEFORE the data actually gets read.  If the schema is not known in advance, JSON for instance, Drill can discover the schema as it is reading the data, by dynamically adding column vectors as data is ingested, but that's not the case here... 
3.  Once the schema is defined, Drill will then read the file row by row, parse the data, and assign values to each column vector. 

There are a few more details but that's the essence.  

What would be great is if we could create a function that could directly map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does natively support JSON, however, it would probably be more effective and efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need some sort of Iterable object so that Drill can map the parsed fields to the schema.  

If you want to take a look at a relatively simple format plugin take a look here: [2]. This file is the BatchReader which is where most of the heavy lifting takes place.  This plugin is for ESRI Shape files and has a mix of pre-defined fields, nested fields and fields that are defined after reading starts.


[1]: https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
[2]: https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>


I can start a draft PR on the Drill side over the weekend and will share the link to this list.
Respectfully, 
-- C


> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <st...@gmail.com> wrote:
> 
> I definitely agree. Apache Drill seems like a logical place to add
> Daffodil support. And I'm sure many of us, including myself, would be
> happy to provide some time towards this effort.
> 
> The Daffodil API is actually fairly simple and is usually fairly
> straightforward to integrate--most of the complexity comes from the DFDL
> schemas. There's a good "hello world" available [1] that shows more API
> functionality/errors/etc., but the jist of it is:
> 
> 1) Compile a DFDL schema to a data processor:
> 
>  Compiler c = Daffodil.compiler();
>  ProcessorFactory pf = c.compileFile(file);
>  DataProcessor dp = pf.onPath("/");
> 
> 2) Create an input source for the data
> 
>  InputStream is = ...
>  InputSourceDataInputStream in = new InputSourceDataInputStream(is);
> 
> 3) Create an infoset outputter (we have a handful of differnt kinds)
> 
>  JDOMInfosetOutputter out = new JDOMInfosetOutputter();
> 
> 4) Use the DataProcessor to parse the input data to the infoset outputter
> 
>  ParseResult pr = dataProcessor.parse(in, out)
> 
> So I guess the parts that we would need more Drill understanding is what
> the InfosetOutputter (step 3) needs to look like to better integrate
> into Drill. Is there a standard data structure that Drill expects
> representations of data to look like and Drill does the querying on the
> data structure? And is there some sort of schema that Daffodil would
> need to create to describe what this structure looks like so it could
> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
> this data structure, unless Drill already supports XML or JSON.
> 
> Or is it completely up to the Storage Plugin (is that the right term) to
> determine how to take a Drill query and find the appropriate data from
> the data store?
> 
> - Steve
> 
> [1]
> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
> 
> 
> On 11/3/19 9:31 AM, Charles Givre wrote:
>> Hi Julian,
>> It seems like there is a beginning of convergence of the minds here.  I went to 
>> the Apache Roadshow in DC and that was where I learned about DFDL and 
>> immediately thought this was a really interesting possibility.
>> 
>> I'd love to see if we could foster some collaboration between the various 
>> projects on this.  From the Drill side of things, it would make it SO much 
>> easier to get Drill to read (and by extension query) various data types.  I'd be 
>> willing to contribute time from the Drill side, but I definitely will need help 
>> understanding how DFDL works.
>> 
>> --C
>> 
>> 
>> 
>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feinauer@pragmaticminds.de 
>>> <ma...@pragmaticminds.de>> wrote:
>>> 
>>> Hi Charles,
>>> this is an interesting idea and in fact we also discussed the same matter for 
>>> Calcite at ApacheCon NA.
>>> But, I agree that it would be really powerful together with a complete Runtime 
>>> like Drill.
>>> Julian
>>> *Von:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>> *Antworten an:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
>>> *An:*"Costello, Roger L." <costello@mitre.org <ma...@mitre.org>>
>>> *Cc:*"users@daffodil.apache.org <ma...@daffodil.apache.org>" 
>>> <users@daffodil.apache.org <ma...@daffodil.apache.org>>
>>> *Betreff:*Re: Use cases for DFDL
>>> +1
>>> 
>>> 
>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <costello@mitre.org 
>>>> <ma...@mitre.org>> wrote:
>>>> Excellent! Okay, here’s the use case:
>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>> use ANSI SQL to query the data, join it with other data, do analysis, etc., 
>>>> just as if it came from a database. So, instead of parsing data to XML and 
>>>> then using XPath to pull out data, you could instead parse data to Apache 
>>>> Drill's data representation and then use ANSI SQL to pull out data, and even 
>>>> combine it with other non-Daffodil data types. The advantage for this would 
>>>> be that it would make it very easy to enable Drill to query new data types 
>>>> (IE simply by using a DFDL schema) and it would enable users to easily query 
>>>> this data without having to load it into another system.
>>>> How’s that Charles?
>>>> /Roger
>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It is 
>>>> regular ANSI SQL.  IMHO, I think this. would be a really great collaboration 
>>>> of the two communities.
>>>> --C
>>>> 
>>>> 
>>>> 
>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <costello@mitre.org 
>>>>> <ma...@mitre.org>> wrote:
>>>>> Thanks again Charles. Is the following use case description correct?
>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could 
>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts of 
>>>>> that data, join it with other data, do analysis, etc., just as if it came 
>>>>> from a database. So, instead of parsing data to XML and then using XPath to 
>>>>> pull out data, you could instead parse data to Apache Drill's data 
>>>>> representation and then use Drills rich data-query capabilities to pull out 
>>>>> data, and even combine it with other non-Daffodil data types. The advantage 
>>>>> for this would be that it would make it very easy to enable Drill to query 
>>>>> new data types (IE simply by using a DFDL schema) and it would enable users 
>>>>> to easily query this data without having to load it into another system.
>>>>> Is that correct?
>>>>> /Roger
>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
>>>>> *To:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>> *Cc:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>> Not exactly...
>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
>>>>> plugin could be written for Drill that mirrors this schema and ultimately 
>>>>> reads the data files.  Drill wouldn't be populating any database, but rather 
>>>>> directly querying the data.
>>>>> The advantage for this would be that it would make it very easy to enable 
>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
>>>>> would enable users to easily query this data w/o having to load it into 
>>>>> another system.  Does that make sense?
>>>>> -- C
>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <costello@mitre.org 
>>>>>> <ma...@mitre.org>> wrote:
>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill to 
>>>>>> query the database.
>>>>>> Is that correct?
>>>>>> /Roger
>>>>>> *From:*Charles Givre <cgivre@gmail.com <ma...@gmail.com>>
>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think a 
>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to enable 
>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
>>>>>> applied to other SQL query engines such as Presto and/or Impala.
>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
>>>>>> -- C
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <costello@mitre.org 
>>>>>>> <ma...@mitre.org>> wrote:
>>>>>>> Thanks Mike! I updated the slide:
>>>>>>> <image002.png>
>>>>>>> *From:*Beckerle, Mike <mbeckerle@tresys.com <ma...@tresys.com>>
>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
>>>>>>> *To:*users@daffodil.apache.org <ma...@daffodil.apache.org>
>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>> I would not pick on RDF data stores as the target.
>>>>>>> Parsing data to populate a database (any variety) is the actual case. The 
>>>>>>> fact that we did do one project involving RDF is why I cited that example 
>>>>>>> in particular but pulling data into any data store/data base begins with 
>>>>>>> the ability to parse the data, and then process it into suitable form.
>>>>>>> This is an incomplete list so perhaps this slide title should be "Example 
>>>>>>> Use Cases for DFDL" ?
>>>>>>> ...mikeb
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> *From:*Costello, Roger L. <costello@mitre.org <ma...@mitre.org>>
>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
>>>>>>> *To:*users@daffodil.apache.org 
>>>>>>> <ma...@daffodil.apache.org><users@daffodil.apache.org 
>>>>>>> <ma...@daffodil.apache.org>>
>>>>>>> *Subject:*Use cases for DFDL
>>>>>>> Hi Folks,
>>>>>>> I created a slide of use cases. See below. Do you agree with the slide? 
>>>>>>> Anything you would add, delete, or change?  /Roger
>>>>>>> <image003.png>
>> 
>