You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Edmon Begoli <eb...@gmail.com> on 2015/09/13 22:42:55 UTC

Update on EDI support for Drill - repo and design collaboratory

Ted, Matt, et al.,

I have created temporary repository for design and development of the
support for EDI format in Drill.
At this point, it is not a fork of Drill, but rather a collaboration space
and code repository for exploratory code.

Wiki:
https://github.com/ebegoli/edi-drill-store/wiki

Repo:
https://github.com/ebegoli/edi-drill-store

Once the difficult parts specific to EDI (logical nesting, record
representation) are figured out, and generic code written for I/O and
translation,
I will look to merge this with Drill and blend it into Drill-specific
patterns.

*If you wish, I will add you to the repo, so you can edit Wiki.*

Let me know please.

Edmon


On Sun, Sep 6, 2015 at 7:16 AM, Edmon Begoli <eb...@gmail.com> wrote:

> Matt - that is fantastic. Having good, liberally licensed format
> converters probably takes care of the 50% of the problem. The other 50%
> will be in figuring out the logical mapping.
>
> Let me think a little bit and propose how can we best set up a
> collaboration platform. Any suggestion for this welcome.
>
> I personally like Google stuff, Hangouts, docs, and Github, of course.
>
>
> On Saturday, September 5, 2015, Matthew Burgess <ma...@gmail.com>
> wrote:
>
>> Edmon,
>>
>> All our Data Integration (file-format parsing, e.g.) code is Apache-2.0
>> licensed, we have parsers/processors
>> <
>> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentah
>> o/di/trans/steps
>> <https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentaho/di/trans/steps>>
>> for EDI / XML(StaX) / HL7 / YAML, etc. I have a plugin
>> <https://github.com/mattyb149/load-text-from-file-plugin>  (also
>> Apache-2.0)
>> using Tika to extract metadata, this could be refactored as a Drill
>> plugin.
>>
>> The (semi-)structured-to-tabular conversion will be an issue that most
>> Drill
>> extenders will have to deal with, although with powerful functions like
>> KVGEN() and FLATTEN() it should be less daunting. For graphs
>> (highly-structured but non-tabular data sources), I'm also looking into a
>> Gremlin <http://tinkerpop.incubator.apache.org/>  plugin, which could
>> connect Graph Databases with Drill. Again, the problem is representing
>> non-tabular data in a SQL environment as you mentioned.
>>
>> Regards,
>> Matt
>>
>> From:  Edmon Begoli <eb...@gmail.com>
>> Reply-To:  <de...@drill.apache.org>
>> Date:  Saturday, September 5, 2015 at 8:46 PM
>> To:  <de...@drill.apache.org>
>> Subject:  Re: Data representation and conversation - translating nested
>> hierarchies into a tabular/queriable format
>>
>> Matt - any contribution of your time is welcome! Thank you.
>>
>> These problems that we are wanting to look into are not easy problems; I
>> would not expect quick solutions, but any good idea, contribution of time,
>> or code will help us advance the state of the capabilities.
>>
>> I might create a branch or separate Github repo, so that we just use its
>> wiki for documentation and collaboration, and then later for scratch pad
>> development.
>>
>> Regarding existing tools you might have - *do you think you could bring
>> this code under the Apache 2 license?*
>> Knowing what you told me before, I think that contributing this code would
>> help advance the state of the Drill's format support tremendously.
>>
>> I see two major challenges related to what I am proposing:
>>
>> 1. (greater challenge) How to bring heterogeneously structured data
>> logically and semantically into the tabular orientation of a typical SQL
>> query processing engine.
>> I think that some problems will not be completely implementable, so we'll
>> need to either approximate or make some limiting/bounding design choices.
>>
>> 2. How to support these new formats through the Drill API. This is more of
>> just a API study, design and programming effort. Nothing contradictory.
>>
>> Edmon
>>
>>
>>
>>
>> On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <ma...@gmail.com> wrote:
>>
>> >  Challenge accepted! :) are we talking about things like XML, Jsonnet,
>> >  Yaml, etc.? And/or binary file formats that are (semi-)structured in
>> nature
>> >  like XLSX?
>> >
>> >  If we want to go more unstructured we could look at Apache Tika to at
>> >  least pull out metadata on things like image and video files, and I'm
>> >  tinkering with the idea of a UDF called topics() for human-generated
>> text
>> >  using Apache OpenNLP, the problem being a well-trained model for the
>> target
>> >  data.
>> >
>> >  Edmon, I admire your ambition and would like to help out where/when I
>> can.
>> >  Having said that, so far my amount of available time for Drill has been
>> >  embarrassingly lower than my amount of interest.
>> >
>> >  For well-known file formats, I may be able to help with some of our
>> >  open-source tools for parsing such files.
>> >
>> >  Regards,
>> >  Matt
>> >
>> >  Sent from my iPhone
>> >
>> >>  > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <eb...@gmail.com> wrote:
>> >>  >
>> >>  > Anyone else from the Drill team wholeheartedly invited.
>> >>  >
>> >>  > Edmon
>> >>  >
>> >>>  >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli <eb...@gmail.com>
>> wrote:
>> >>>  >>
>> >>>  >> Let's do it, Ted. I think it would add tremendous value to Drill
>> as a
>> >>>  >> solution.
>> >>>  >>
>> >>>  >> I will start a Google doc and share with you so we can share
>> ideas,
>> >>>  >> have Hangouts, design, etc. until we have something solid to put
>> into
>> >  Drill
>> >>>  >> proper.
>> >>>  >>
>> >>>  >> If you have any other suggestion for the mode of collaboration
>> please
>> >  let
>> >>>  >> me know.
>> >>>  >>
>> >>>>  >>> On Saturday, September 5, 2015, Ted Dunning <
>> ted.dunning@gmail.com>
>> >  wrote:
>> >>>>  >>>
>> >>>>>  >>>> On Sat, Sep 5, 2015 at 8:57 AM, Edmon Begoli <
>> ebegoli@gmail.com>
>> >  wrote:
>> >>>>>  >>>>
>> >>>>>  >>>> *My question - has this been handled already in Drill and
>> storage
>> >>>>  >>> formats?*
>> >>>>>  >>>>
>> >>>>>  >>>> If so, where?
>> >>>>>  >>>>
>> >>>>>  >>>> If not,what is your recommendation for handling this?
>> >>>>>  >>>>
>> >>>>>  >>>> Should it be in an independent library outside of Drill that
>> >>>>> presents
>> >  a
>> >>>>>  >>>> flattened version (not sure if this is possible), or maybe
>> break the
>> >>>>>  >>>> message into tables corresponding to header data, items,
>> footer.
>> >>>>  >>>
>> >>>>  >>> Drill does handle these kinds of data well, but currently the
>> only
>> file
>> >>>>  >>> formats that it can consume for this kind of data are JSON and
>> >>>> Parquet.
>> >>>>  >>>
>> >>>>  >>> IT would be great to have more.  I would love to work on this
>> with
>> you.
>> >>>  >>
>> >
>>
>>
>>
>>

Re: Update on EDI support for Drill - repo and design collaboratory

Posted by Ted Dunning <te...@gmail.com>.
Take a look at the JSON input format plugin.  That can't be  cloned outside
of Drill at this point because it involves access to some internals, but it
should provide some guidance about how to read complex objects.



On Sun, Sep 13, 2015 at 8:42 PM, Edmon Begoli <eb...@gmail.com> wrote:

> I understand. I hope you and the rest will help me with design guidance as
> I start translating EDI format into a Drill-amenable one.
>
> On Sunday, September 13, 2015, Ted Dunning <te...@gmail.com> wrote:
>
> > I doubt that I will be able to produce significant amounts of code. If I
> do
> > produce much of anything, I would be happy to contribute via pull
> requests.
> >
> > So I don't need to be on the repo as a contributor.
> >
> > On Sun, Sep 13, 2015 at 1:42 PM, Edmon Begoli <ebegoli@gmail.com
> > <javascript:;>> wrote:
> >
> > > Ted, Matt, et al.,
> > >
> > > I have created temporary repository for design and development of the
> > > support for EDI format in Drill.
> > > At this point, it is not a fork of Drill, but rather a collaboration
> > space
> > > and code repository for exploratory code.
> > >
> > > Wiki:
> > > https://github.com/ebegoli/edi-drill-store/wiki
> > >
> > > Repo:
> > > https://github.com/ebegoli/edi-drill-store
> > >
> > > Once the difficult parts specific to EDI (logical nesting, record
> > > representation) are figured out, and generic code written for I/O and
> > > translation,
> > > I will look to merge this with Drill and blend it into Drill-specific
> > > patterns.
> > >
> > > *If you wish, I will add you to the repo, so you can edit Wiki.*
> > >
> > > Let me know please.
> > >
> > > Edmon
> > >
> > >
> > > On Sun, Sep 6, 2015 at 7:16 AM, Edmon Begoli <ebegoli@gmail.com
> > <javascript:;>> wrote:
> > >
> > > > Matt - that is fantastic. Having good, liberally licensed format
> > > > converters probably takes care of the 50% of the problem. The other
> 50%
> > > > will be in figuring out the logical mapping.
> > > >
> > > > Let me think a little bit and propose how can we best set up a
> > > > collaboration platform. Any suggestion for this welcome.
> > > >
> > > > I personally like Google stuff, Hangouts, docs, and Github, of
> course.
> > > >
> > > >
> > > > On Saturday, September 5, 2015, Matthew Burgess <mattyb149@gmail.com
> > <javascript:;>>
> > > > wrote:
> > > >
> > > >> Edmon,
> > > >>
> > > >> All our Data Integration (file-format parsing, e.g.) code is
> > Apache-2.0
> > > >> licensed, we have parsers/processors
> > > >> <
> > > >>
> > >
> >
> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentah
> > > >> o/di/trans/steps
> > > >> <
> > >
> >
> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentaho/di/trans/steps
> > > >>
> > > >> for EDI / XML(StaX) / HL7 / YAML, etc. I have a plugin
> > > >> <https://github.com/mattyb149/load-text-from-file-plugin>  (also
> > > >> Apache-2.0)
> > > >> using Tika to extract metadata, this could be refactored as a Drill
> > > >> plugin.
> > > >>
> > > >> The (semi-)structured-to-tabular conversion will be an issue that
> most
> > > >> Drill
> > > >> extenders will have to deal with, although with powerful functions
> > like
> > > >> KVGEN() and FLATTEN() it should be less daunting. For graphs
> > > >> (highly-structured but non-tabular data sources), I'm also looking
> > into
> > > a
> > > >> Gremlin <http://tinkerpop.incubator.apache.org/>  plugin, which
> could
> > > >> connect Graph Databases with Drill. Again, the problem is
> representing
> > > >> non-tabular data in a SQL environment as you mentioned.
> > > >>
> > > >> Regards,
> > > >> Matt
> > > >>
> > > >> From:  Edmon Begoli <ebegoli@gmail.com <javascript:;>>
> > > >> Reply-To:  <dev@drill.apache.org <javascript:;>>
> > > >> Date:  Saturday, September 5, 2015 at 8:46 PM
> > > >> To:  <dev@drill.apache.org <javascript:;>>
> > > >> Subject:  Re: Data representation and conversation - translating
> > nested
> > > >> hierarchies into a tabular/queriable format
> > > >>
> > > >> Matt - any contribution of your time is welcome! Thank you.
> > > >>
> > > >> These problems that we are wanting to look into are not easy
> > problems; I
> > > >> would not expect quick solutions, but any good idea, contribution of
> > > time,
> > > >> or code will help us advance the state of the capabilities.
> > > >>
> > > >> I might create a branch or separate Github repo, so that we just use
> > its
> > > >> wiki for documentation and collaboration, and then later for scratch
> > pad
> > > >> development.
> > > >>
> > > >> Regarding existing tools you might have - *do you think you could
> > bring
> > > >> this code under the Apache 2 license?*
> > > >> Knowing what you told me before, I think that contributing this code
> > > would
> > > >> help advance the state of the Drill's format support tremendously.
> > > >>
> > > >> I see two major challenges related to what I am proposing:
> > > >>
> > > >> 1. (greater challenge) How to bring heterogeneously structured data
> > > >> logically and semantically into the tabular orientation of a typical
> > SQL
> > > >> query processing engine.
> > > >> I think that some problems will not be completely implementable, so
> > > we'll
> > > >> need to either approximate or make some limiting/bounding design
> > > choices.
> > > >>
> > > >> 2. How to support these new formats through the Drill API. This is
> > more
> > > of
> > > >> just a API study, design and programming effort. Nothing
> > contradictory.
> > > >>
> > > >> Edmon
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <mattyb149@gmail.com
> > <javascript:;>>
> > > wrote:
> > > >>
> > > >> >  Challenge accepted! :) are we talking about things like XML,
> > Jsonnet,
> > > >> >  Yaml, etc.? And/or binary file formats that are (semi-)structured
> > in
> > > >> nature
> > > >> >  like XLSX?
> > > >> >
> > > >> >  If we want to go more unstructured we could look at Apache Tika
> to
> > at
> > > >> >  least pull out metadata on things like image and video files, and
> > I'm
> > > >> >  tinkering with the idea of a UDF called topics() for
> > human-generated
> > > >> text
> > > >> >  using Apache OpenNLP, the problem being a well-trained model for
> > the
> > > >> target
> > > >> >  data.
> > > >> >
> > > >> >  Edmon, I admire your ambition and would like to help out
> > where/when I
> > > >> can.
> > > >> >  Having said that, so far my amount of available time for Drill
> has
> > > been
> > > >> >  embarrassingly lower than my amount of interest.
> > > >> >
> > > >> >  For well-known file formats, I may be able to help with some of
> our
> > > >> >  open-source tools for parsing such files.
> > > >> >
> > > >> >  Regards,
> > > >> >  Matt
> > > >> >
> > > >> >  Sent from my iPhone
> > > >> >
> > > >> >>  > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <ebegoli@gmail.com
> > <javascript:;>>
> > > wrote:
> > > >> >>  >
> > > >> >>  > Anyone else from the Drill team wholeheartedly invited.
> > > >> >>  >
> > > >> >>  > Edmon
> > > >> >>  >
> > > >> >>>  >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli <
> > ebegoli@gmail.com <javascript:;>
> > > >
> > > >> wrote:
> > > >> >>>  >>
> > > >> >>>  >> Let's do it, Ted. I think it would add tremendous value to
> > Drill
> > > >> as a
> > > >> >>>  >> solution.
> > > >> >>>  >>
> > > >> >>>  >> I will start a Google doc and share with you so we can share
> > > >> ideas,
> > > >> >>>  >> have Hangouts, design, etc. until we have something solid to
> > put
> > > >> into
> > > >> >  Drill
> > > >> >>>  >> proper.
> > > >> >>>  >>
> > > >> >>>  >> If you have any other suggestion for the mode of
> collaboration
> > > >> please
> > > >> >  let
> > > >> >>>  >> me know.
> > > >> >>>  >>
> > > >> >>>>  >>> On Saturday, September 5, 2015, Ted Dunning <
> > > >> ted.dunning@gmail.com <javascript:;>>
> > > >> >  wrote:
> > > >> >>>>  >>>
> > > >> >>>>>  >>>> On Sat, Sep 5, 2015 at 8:57 AM, Edmon Begoli <
> > > >> ebegoli@gmail.com <javascript:;>>
> > > >> >  wrote:
> > > >> >>>>>  >>>>
> > > >> >>>>>  >>>> *My question - has this been handled already in Drill
> and
> > > >> storage
> > > >> >>>>  >>> formats?*
> > > >> >>>>>  >>>>
> > > >> >>>>>  >>>> If so, where?
> > > >> >>>>>  >>>>
> > > >> >>>>>  >>>> If not,what is your recommendation for handling this?
> > > >> >>>>>  >>>>
> > > >> >>>>>  >>>> Should it be in an independent library outside of Drill
> > that
> > > >> >>>>> presents
> > > >> >  a
> > > >> >>>>>  >>>> flattened version (not sure if this is possible), or
> maybe
> > > >> break the
> > > >> >>>>>  >>>> message into tables corresponding to header data, items,
> > > >> footer.
> > > >> >>>>  >>>
> > > >> >>>>  >>> Drill does handle these kinds of data well, but currently
> > the
> > > >> only
> > > >> file
> > > >> >>>>  >>> formats that it can consume for this kind of data are JSON
> > and
> > > >> >>>> Parquet.
> > > >> >>>>  >>>
> > > >> >>>>  >>> IT would be great to have more.  I would love to work on
> > this
> > > >> with
> > > >> you.
> > > >> >>>  >>
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >>
> > >
> >
>

Re: Update on EDI support for Drill - repo and design collaboratory

Posted by Edmon Begoli <eb...@gmail.com>.
I understand. I hope you and the rest will help me with design guidance as
I start translating EDI format into a Drill-amenable one.

On Sunday, September 13, 2015, Ted Dunning <te...@gmail.com> wrote:

> I doubt that I will be able to produce significant amounts of code. If I do
> produce much of anything, I would be happy to contribute via pull requests.
>
> So I don't need to be on the repo as a contributor.
>
> On Sun, Sep 13, 2015 at 1:42 PM, Edmon Begoli <ebegoli@gmail.com
> <javascript:;>> wrote:
>
> > Ted, Matt, et al.,
> >
> > I have created temporary repository for design and development of the
> > support for EDI format in Drill.
> > At this point, it is not a fork of Drill, but rather a collaboration
> space
> > and code repository for exploratory code.
> >
> > Wiki:
> > https://github.com/ebegoli/edi-drill-store/wiki
> >
> > Repo:
> > https://github.com/ebegoli/edi-drill-store
> >
> > Once the difficult parts specific to EDI (logical nesting, record
> > representation) are figured out, and generic code written for I/O and
> > translation,
> > I will look to merge this with Drill and blend it into Drill-specific
> > patterns.
> >
> > *If you wish, I will add you to the repo, so you can edit Wiki.*
> >
> > Let me know please.
> >
> > Edmon
> >
> >
> > On Sun, Sep 6, 2015 at 7:16 AM, Edmon Begoli <ebegoli@gmail.com
> <javascript:;>> wrote:
> >
> > > Matt - that is fantastic. Having good, liberally licensed format
> > > converters probably takes care of the 50% of the problem. The other 50%
> > > will be in figuring out the logical mapping.
> > >
> > > Let me think a little bit and propose how can we best set up a
> > > collaboration platform. Any suggestion for this welcome.
> > >
> > > I personally like Google stuff, Hangouts, docs, and Github, of course.
> > >
> > >
> > > On Saturday, September 5, 2015, Matthew Burgess <mattyb149@gmail.com
> <javascript:;>>
> > > wrote:
> > >
> > >> Edmon,
> > >>
> > >> All our Data Integration (file-format parsing, e.g.) code is
> Apache-2.0
> > >> licensed, we have parsers/processors
> > >> <
> > >>
> >
> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentah
> > >> o/di/trans/steps
> > >> <
> >
> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentaho/di/trans/steps
> > >>
> > >> for EDI / XML(StaX) / HL7 / YAML, etc. I have a plugin
> > >> <https://github.com/mattyb149/load-text-from-file-plugin>  (also
> > >> Apache-2.0)
> > >> using Tika to extract metadata, this could be refactored as a Drill
> > >> plugin.
> > >>
> > >> The (semi-)structured-to-tabular conversion will be an issue that most
> > >> Drill
> > >> extenders will have to deal with, although with powerful functions
> like
> > >> KVGEN() and FLATTEN() it should be less daunting. For graphs
> > >> (highly-structured but non-tabular data sources), I'm also looking
> into
> > a
> > >> Gremlin <http://tinkerpop.incubator.apache.org/>  plugin, which could
> > >> connect Graph Databases with Drill. Again, the problem is representing
> > >> non-tabular data in a SQL environment as you mentioned.
> > >>
> > >> Regards,
> > >> Matt
> > >>
> > >> From:  Edmon Begoli <ebegoli@gmail.com <javascript:;>>
> > >> Reply-To:  <dev@drill.apache.org <javascript:;>>
> > >> Date:  Saturday, September 5, 2015 at 8:46 PM
> > >> To:  <dev@drill.apache.org <javascript:;>>
> > >> Subject:  Re: Data representation and conversation - translating
> nested
> > >> hierarchies into a tabular/queriable format
> > >>
> > >> Matt - any contribution of your time is welcome! Thank you.
> > >>
> > >> These problems that we are wanting to look into are not easy
> problems; I
> > >> would not expect quick solutions, but any good idea, contribution of
> > time,
> > >> or code will help us advance the state of the capabilities.
> > >>
> > >> I might create a branch or separate Github repo, so that we just use
> its
> > >> wiki for documentation and collaboration, and then later for scratch
> pad
> > >> development.
> > >>
> > >> Regarding existing tools you might have - *do you think you could
> bring
> > >> this code under the Apache 2 license?*
> > >> Knowing what you told me before, I think that contributing this code
> > would
> > >> help advance the state of the Drill's format support tremendously.
> > >>
> > >> I see two major challenges related to what I am proposing:
> > >>
> > >> 1. (greater challenge) How to bring heterogeneously structured data
> > >> logically and semantically into the tabular orientation of a typical
> SQL
> > >> query processing engine.
> > >> I think that some problems will not be completely implementable, so
> > we'll
> > >> need to either approximate or make some limiting/bounding design
> > choices.
> > >>
> > >> 2. How to support these new formats through the Drill API. This is
> more
> > of
> > >> just a API study, design and programming effort. Nothing
> contradictory.
> > >>
> > >> Edmon
> > >>
> > >>
> > >>
> > >>
> > >> On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <mattyb149@gmail.com
> <javascript:;>>
> > wrote:
> > >>
> > >> >  Challenge accepted! :) are we talking about things like XML,
> Jsonnet,
> > >> >  Yaml, etc.? And/or binary file formats that are (semi-)structured
> in
> > >> nature
> > >> >  like XLSX?
> > >> >
> > >> >  If we want to go more unstructured we could look at Apache Tika to
> at
> > >> >  least pull out metadata on things like image and video files, and
> I'm
> > >> >  tinkering with the idea of a UDF called topics() for
> human-generated
> > >> text
> > >> >  using Apache OpenNLP, the problem being a well-trained model for
> the
> > >> target
> > >> >  data.
> > >> >
> > >> >  Edmon, I admire your ambition and would like to help out
> where/when I
> > >> can.
> > >> >  Having said that, so far my amount of available time for Drill has
> > been
> > >> >  embarrassingly lower than my amount of interest.
> > >> >
> > >> >  For well-known file formats, I may be able to help with some of our
> > >> >  open-source tools for parsing such files.
> > >> >
> > >> >  Regards,
> > >> >  Matt
> > >> >
> > >> >  Sent from my iPhone
> > >> >
> > >> >>  > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <ebegoli@gmail.com
> <javascript:;>>
> > wrote:
> > >> >>  >
> > >> >>  > Anyone else from the Drill team wholeheartedly invited.
> > >> >>  >
> > >> >>  > Edmon
> > >> >>  >
> > >> >>>  >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli <
> ebegoli@gmail.com <javascript:;>
> > >
> > >> wrote:
> > >> >>>  >>
> > >> >>>  >> Let's do it, Ted. I think it would add tremendous value to
> Drill
> > >> as a
> > >> >>>  >> solution.
> > >> >>>  >>
> > >> >>>  >> I will start a Google doc and share with you so we can share
> > >> ideas,
> > >> >>>  >> have Hangouts, design, etc. until we have something solid to
> put
> > >> into
> > >> >  Drill
> > >> >>>  >> proper.
> > >> >>>  >>
> > >> >>>  >> If you have any other suggestion for the mode of collaboration
> > >> please
> > >> >  let
> > >> >>>  >> me know.
> > >> >>>  >>
> > >> >>>>  >>> On Saturday, September 5, 2015, Ted Dunning <
> > >> ted.dunning@gmail.com <javascript:;>>
> > >> >  wrote:
> > >> >>>>  >>>
> > >> >>>>>  >>>> On Sat, Sep 5, 2015 at 8:57 AM, Edmon Begoli <
> > >> ebegoli@gmail.com <javascript:;>>
> > >> >  wrote:
> > >> >>>>>  >>>>
> > >> >>>>>  >>>> *My question - has this been handled already in Drill and
> > >> storage
> > >> >>>>  >>> formats?*
> > >> >>>>>  >>>>
> > >> >>>>>  >>>> If so, where?
> > >> >>>>>  >>>>
> > >> >>>>>  >>>> If not,what is your recommendation for handling this?
> > >> >>>>>  >>>>
> > >> >>>>>  >>>> Should it be in an independent library outside of Drill
> that
> > >> >>>>> presents
> > >> >  a
> > >> >>>>>  >>>> flattened version (not sure if this is possible), or maybe
> > >> break the
> > >> >>>>>  >>>> message into tables corresponding to header data, items,
> > >> footer.
> > >> >>>>  >>>
> > >> >>>>  >>> Drill does handle these kinds of data well, but currently
> the
> > >> only
> > >> file
> > >> >>>>  >>> formats that it can consume for this kind of data are JSON
> and
> > >> >>>> Parquet.
> > >> >>>>  >>>
> > >> >>>>  >>> IT would be great to have more.  I would love to work on
> this
> > >> with
> > >> you.
> > >> >>>  >>
> > >> >
> > >>
> > >>
> > >>
> > >>
> >
>

Re: Update on EDI support for Drill - repo and design collaboratory

Posted by Ted Dunning <te...@gmail.com>.
I doubt that I will be able to produce significant amounts of code. If I do
produce much of anything, I would be happy to contribute via pull requests.

So I don't need to be on the repo as a contributor.

On Sun, Sep 13, 2015 at 1:42 PM, Edmon Begoli <eb...@gmail.com> wrote:

> Ted, Matt, et al.,
>
> I have created temporary repository for design and development of the
> support for EDI format in Drill.
> At this point, it is not a fork of Drill, but rather a collaboration space
> and code repository for exploratory code.
>
> Wiki:
> https://github.com/ebegoli/edi-drill-store/wiki
>
> Repo:
> https://github.com/ebegoli/edi-drill-store
>
> Once the difficult parts specific to EDI (logical nesting, record
> representation) are figured out, and generic code written for I/O and
> translation,
> I will look to merge this with Drill and blend it into Drill-specific
> patterns.
>
> *If you wish, I will add you to the repo, so you can edit Wiki.*
>
> Let me know please.
>
> Edmon
>
>
> On Sun, Sep 6, 2015 at 7:16 AM, Edmon Begoli <eb...@gmail.com> wrote:
>
> > Matt - that is fantastic. Having good, liberally licensed format
> > converters probably takes care of the 50% of the problem. The other 50%
> > will be in figuring out the logical mapping.
> >
> > Let me think a little bit and propose how can we best set up a
> > collaboration platform. Any suggestion for this welcome.
> >
> > I personally like Google stuff, Hangouts, docs, and Github, of course.
> >
> >
> > On Saturday, September 5, 2015, Matthew Burgess <ma...@gmail.com>
> > wrote:
> >
> >> Edmon,
> >>
> >> All our Data Integration (file-format parsing, e.g.) code is Apache-2.0
> >> licensed, we have parsers/processors
> >> <
> >>
> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentah
> >> o/di/trans/steps
> >> <
> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentaho/di/trans/steps
> >>
> >> for EDI / XML(StaX) / HL7 / YAML, etc. I have a plugin
> >> <https://github.com/mattyb149/load-text-from-file-plugin>  (also
> >> Apache-2.0)
> >> using Tika to extract metadata, this could be refactored as a Drill
> >> plugin.
> >>
> >> The (semi-)structured-to-tabular conversion will be an issue that most
> >> Drill
> >> extenders will have to deal with, although with powerful functions like
> >> KVGEN() and FLATTEN() it should be less daunting. For graphs
> >> (highly-structured but non-tabular data sources), I'm also looking into
> a
> >> Gremlin <http://tinkerpop.incubator.apache.org/>  plugin, which could
> >> connect Graph Databases with Drill. Again, the problem is representing
> >> non-tabular data in a SQL environment as you mentioned.
> >>
> >> Regards,
> >> Matt
> >>
> >> From:  Edmon Begoli <eb...@gmail.com>
> >> Reply-To:  <de...@drill.apache.org>
> >> Date:  Saturday, September 5, 2015 at 8:46 PM
> >> To:  <de...@drill.apache.org>
> >> Subject:  Re: Data representation and conversation - translating nested
> >> hierarchies into a tabular/queriable format
> >>
> >> Matt - any contribution of your time is welcome! Thank you.
> >>
> >> These problems that we are wanting to look into are not easy problems; I
> >> would not expect quick solutions, but any good idea, contribution of
> time,
> >> or code will help us advance the state of the capabilities.
> >>
> >> I might create a branch or separate Github repo, so that we just use its
> >> wiki for documentation and collaboration, and then later for scratch pad
> >> development.
> >>
> >> Regarding existing tools you might have - *do you think you could bring
> >> this code under the Apache 2 license?*
> >> Knowing what you told me before, I think that contributing this code
> would
> >> help advance the state of the Drill's format support tremendously.
> >>
> >> I see two major challenges related to what I am proposing:
> >>
> >> 1. (greater challenge) How to bring heterogeneously structured data
> >> logically and semantically into the tabular orientation of a typical SQL
> >> query processing engine.
> >> I think that some problems will not be completely implementable, so
> we'll
> >> need to either approximate or make some limiting/bounding design
> choices.
> >>
> >> 2. How to support these new formats through the Drill API. This is more
> of
> >> just a API study, design and programming effort. Nothing contradictory.
> >>
> >> Edmon
> >>
> >>
> >>
> >>
> >> On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <ma...@gmail.com>
> wrote:
> >>
> >> >  Challenge accepted! :) are we talking about things like XML, Jsonnet,
> >> >  Yaml, etc.? And/or binary file formats that are (semi-)structured in
> >> nature
> >> >  like XLSX?
> >> >
> >> >  If we want to go more unstructured we could look at Apache Tika to at
> >> >  least pull out metadata on things like image and video files, and I'm
> >> >  tinkering with the idea of a UDF called topics() for human-generated
> >> text
> >> >  using Apache OpenNLP, the problem being a well-trained model for the
> >> target
> >> >  data.
> >> >
> >> >  Edmon, I admire your ambition and would like to help out where/when I
> >> can.
> >> >  Having said that, so far my amount of available time for Drill has
> been
> >> >  embarrassingly lower than my amount of interest.
> >> >
> >> >  For well-known file formats, I may be able to help with some of our
> >> >  open-source tools for parsing such files.
> >> >
> >> >  Regards,
> >> >  Matt
> >> >
> >> >  Sent from my iPhone
> >> >
> >> >>  > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <eb...@gmail.com>
> wrote:
> >> >>  >
> >> >>  > Anyone else from the Drill team wholeheartedly invited.
> >> >>  >
> >> >>  > Edmon
> >> >>  >
> >> >>>  >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli <ebegoli@gmail.com
> >
> >> wrote:
> >> >>>  >>
> >> >>>  >> Let's do it, Ted. I think it would add tremendous value to Drill
> >> as a
> >> >>>  >> solution.
> >> >>>  >>
> >> >>>  >> I will start a Google doc and share with you so we can share
> >> ideas,
> >> >>>  >> have Hangouts, design, etc. until we have something solid to put
> >> into
> >> >  Drill
> >> >>>  >> proper.
> >> >>>  >>
> >> >>>  >> If you have any other suggestion for the mode of collaboration
> >> please
> >> >  let
> >> >>>  >> me know.
> >> >>>  >>
> >> >>>>  >>> On Saturday, September 5, 2015, Ted Dunning <
> >> ted.dunning@gmail.com>
> >> >  wrote:
> >> >>>>  >>>
> >> >>>>>  >>>> On Sat, Sep 5, 2015 at 8:57 AM, Edmon Begoli <
> >> ebegoli@gmail.com>
> >> >  wrote:
> >> >>>>>  >>>>
> >> >>>>>  >>>> *My question - has this been handled already in Drill and
> >> storage
> >> >>>>  >>> formats?*
> >> >>>>>  >>>>
> >> >>>>>  >>>> If so, where?
> >> >>>>>  >>>>
> >> >>>>>  >>>> If not,what is your recommendation for handling this?
> >> >>>>>  >>>>
> >> >>>>>  >>>> Should it be in an independent library outside of Drill that
> >> >>>>> presents
> >> >  a
> >> >>>>>  >>>> flattened version (not sure if this is possible), or maybe
> >> break the
> >> >>>>>  >>>> message into tables corresponding to header data, items,
> >> footer.
> >> >>>>  >>>
> >> >>>>  >>> Drill does handle these kinds of data well, but currently the
> >> only
> >> file
> >> >>>>  >>> formats that it can consume for this kind of data are JSON and
> >> >>>> Parquet.
> >> >>>>  >>>
> >> >>>>  >>> IT would be great to have more.  I would love to work on this
> >> with
> >> you.
> >> >>>  >>
> >> >
> >>
> >>
> >>
> >>
>