You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Ted Dunning <te...@gmail.com> on 2019/04/02 20:06:12 UTC

Re: [DISCUSS]: Additional Formats for Drill

I have no idea how much uptake these would have, but if the library can
give all the formats all at once for modest effort, that would be great.

On Tue, Apr 2, 2019 at 9:22 AM Charles Givre <cg...@gmail.com> wrote:

> Hello everyone,
> I recently presented a talk at the ASF DC Roadshow (shameless plug[1] )
> but heard a really good talk by a PMC member for the Apache Daffodil
> (incubating) project.  At its core, Daffodil is a collection of parsers
> which convert various data formats to a standard structure which can then
> be ingested into other tools.   Some of these formats Drill already can
> ingest natively such as PCAP, CSV however many cannot such as NACHA (bulk
> financial transactions), vCard, Shapefile, and many more.  Here is a brief
> presentation about Daffodil [2].
>
> The DFDLSchemas github has a handful of DFDL schemas that are pretty good
> open source examples[3].
>
> On a related note, I stumbled on the Kaitai struct library[4] which is
> another library which performs a similar function to Daffodil.  Would it be
> of interest for the community to incorporate these libraries into Drill?
> My thought is that it would greatly increase the types of data that Drill
> can natively query and hence seriously increase Drill’s usefulness.  If
> there is interest, (and honestly even if there isn’t) I can start working
> on this for the next release of Drill.
>
>
> [1]:
> https://www.slideshare.net/cgivre/drilling-cyber-security-data-with-apache-drill
> <
> https://www.slideshare.net/cgivre/drilling-cyber-security-data-with-apache-drill
> >
> [2]:
> https://www.slideshare.net/mbeckerle/tresys-dfdl-data-format-description-language-daffodil-open-source-public-overview-100432615
> <
> https://www.slideshare.net/mbeckerle/tresys-dfdl-data-format-description-language-daffodil-open-source-public-overview-100432615
> >
> [3]: https://github.com/DFDLSchemas <https://github.com/DFDLSchemas>
> [4]: http://formats.kaitai.io <http://formats.kaitai.io/>
>
>

Re: [DISCUSS]: Additional Formats for Drill

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi All,

Daffodil is an interesting project as is the DFDLSchemas project. Thanks for sharing!

An interesting challenge is how these libraries load data: what is their internal format, or what API do they use for the application to consume data? Found this for Daffodil, it will "parse data into an infoset represented as XML or JSON"

Drill is part of the "big data" ecosystem. Converting a 100GB file, say, into XML, then into Drill would be a bit cumbersome. Better would be if the libraries provided an API that Drill could implement to receive the data and write it to vectors using, say, the new row set framework that we've just added for CSV and will soon add for JSON. Both JSON and XML provide a parser to which the app provides an implementation. Drill uses this approach to parse JSON.

Another issue is file splits: to store a large file on HDFS (yes, HDFS is old, everyone uses S3 now), we want Drill to read each file block separately. The means the file must be "splittable": there must be some well-defined token that the scanner can search for at block boundaries. Not clear if these parsers are designed for this big data model.

For both projects, would be good to read data into Arrow. Ideally, we'd get a volunteer to port the row set mechanism to Arrow so that the same API can write to both Arrow and Drill vectors (saving the entire world from having to write their own vector writing mechanisms.)

Thanks,
- Paul

    On Tuesday, April 2, 2019, 1:06:53 PM PDT, Ted Dunning <te...@gmail.com> wrote:  

 I have no idea how much uptake these would have, but if the library can
give all the formats all at once for modest effort, that would be great.

On Tue, Apr 2, 2019 at 9:22 AM Charles Givre <cg...@gmail.com> wrote:

> Hello everyone,
> I recently presented a talk at the ASF DC Roadshow (shameless plug[1] )
> but heard a really good talk by a PMC member for the Apache Daffodil
> (incubating) project.  At its core, Daffodil is a collection of parsers
> which convert various data formats to a standard structure which can then
> be ingested into other tools.  Some of these formats Drill already can
> ingest natively such as PCAP, CSV however many cannot such as NACHA (bulk
> financial transactions), vCard, Shapefile, and many more.  Here is a brief
> presentation about Daffodil [2].
>
> The DFDLSchemas github has a handful of DFDL schemas that are pretty good
> open source examples[3].
>
> On a related note, I stumbled on the Kaitai struct library[4] which is
> another library which performs a similar function to Daffodil.  Would it be
> of interest for the community to incorporate these libraries into Drill?
> My thought is that it would greatly increase the types of data that Drill
> can natively query and hence seriously increase Drill’s usefulness.  If
> there is interest, (and honestly even if there isn’t) I can start working
> on this for the next release of Drill.
>
>
> [1]:
> https://www.slideshare.net/cgivre/drilling-cyber-security-data-with-apache-drill
> <
> https://www.slideshare.net/cgivre/drilling-cyber-security-data-with-apache-drill
> >
> [2]:
> https://www.slideshare.net/mbeckerle/tresys-dfdl-data-format-description-language-daffodil-open-source-public-overview-100432615
> <
> https://www.slideshare.net/mbeckerle/tresys-dfdl-data-format-description-language-daffodil-open-source-public-overview-100432615
> >
> [3]: https://github.com/DFDLSchemas <https://github.com/DFDLSchemas>
> [4]: http://formats.kaitai.io <http://formats.kaitai.io/>
>
>

Re: [DISCUSS]: Additional Formats for Drill

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi All,

Daffodil is an interesting project as is the DFDLSchemas project. Thanks for sharing!

An interesting challenge is how these libraries load data: what is their internal format, or what API do they use for the application to consume data? Found this for Daffodil, it will "parse data into an infoset represented as XML or JSON"

Drill is part of the "big data" ecosystem. Converting a 100GB file, say, into XML, then into Drill would be a bit cumbersome. Better would be if the libraries provided an API that Drill could implement to receive the data and write it to vectors using, say, the new row set framework that we've just added for CSV and will soon add for JSON. Both JSON and XML provide a parser to which the app provides an implementation. Drill uses this approach to parse JSON.

Another issue is file splits: to store a large file on HDFS (yes, HDFS is old, everyone uses S3 now), we want Drill to read each file block separately. The means the file must be "splittable": there must be some well-defined token that the scanner can search for at block boundaries. Not clear if these parsers are designed for this big data model.

For both projects, would be good to read data into Arrow. Ideally, we'd get a volunteer to port the row set mechanism to Arrow so that the same API can write to both Arrow and Drill vectors (saving the entire world from having to write their own vector writing mechanisms.)

Thanks,
- Paul

    On Tuesday, April 2, 2019, 1:06:53 PM PDT, Ted Dunning <te...@gmail.com> wrote:  

 I have no idea how much uptake these would have, but if the library can
give all the formats all at once for modest effort, that would be great.

On Tue, Apr 2, 2019 at 9:22 AM Charles Givre <cg...@gmail.com> wrote:

> Hello everyone,
> I recently presented a talk at the ASF DC Roadshow (shameless plug[1] )
> but heard a really good talk by a PMC member for the Apache Daffodil
> (incubating) project.  At its core, Daffodil is a collection of parsers
> which convert various data formats to a standard structure which can then
> be ingested into other tools.  Some of these formats Drill already can
> ingest natively such as PCAP, CSV however many cannot such as NACHA (bulk
> financial transactions), vCard, Shapefile, and many more.  Here is a brief
> presentation about Daffodil [2].
>
> The DFDLSchemas github has a handful of DFDL schemas that are pretty good
> open source examples[3].
>
> On a related note, I stumbled on the Kaitai struct library[4] which is
> another library which performs a similar function to Daffodil.  Would it be
> of interest for the community to incorporate these libraries into Drill?
> My thought is that it would greatly increase the types of data that Drill
> can natively query and hence seriously increase Drill’s usefulness.  If
> there is interest, (and honestly even if there isn’t) I can start working
> on this for the next release of Drill.
>
>
> [1]:
> https://www.slideshare.net/cgivre/drilling-cyber-security-data-with-apache-drill
> <
> https://www.slideshare.net/cgivre/drilling-cyber-security-data-with-apache-drill
> >
> [2]:
> https://www.slideshare.net/mbeckerle/tresys-dfdl-data-format-description-language-daffodil-open-source-public-overview-100432615
> <
> https://www.slideshare.net/mbeckerle/tresys-dfdl-data-format-description-language-daffodil-open-source-public-overview-100432615
> >
> [3]: https://github.com/DFDLSchemas <https://github.com/DFDLSchemas>
> [4]: http://formats.kaitai.io <http://formats.kaitai.io/>
>
>