You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by John Omernik <jo...@omernik.com> on 2016/05/25 15:05:24 UTC

Is there a good way to handle bad date data?

I have some DOBs, and some fields are empty others apparently were filled
by trained monkeys, but while most data is accurate, some data is not.

As you saw from my other post, I am trying to get the age for those DOBs
that are valid...

My function works, until I get to a record that is not valid and I get
something like this:

Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for monthOfYear
must be in the range [1,12]


Is there a good "Try -> Except" type solution that will grant me the valid
data if things worked, and just return 0 or whatever I specify if there is
an error?

I could try casting the data, but if it fails won't it kill my query?
Basically I want it to keep going if it fails... not sure if Drill has this
ability, but thought I would ask.

Re: Is there a good way to handle bad date data?

Posted by Charles Givre <cg...@gmail.com>.
I would have a big +1 to a community repo of Drill UDFs.  
John, 
How do you think the best way to host that would be?
— C

> On May 25, 2016, at 12:55, John Omernik <jo...@omernik.com> wrote:
> 
> Cool, I wasn't aware of SIMILAR to (I learned something)  However, that
> doesn't work because my data is accurate i.e. '____-__-__' 2015-04-02 and
> 2015-00-23  but 00 doesn't work (bad data) .
> 
> UDFs scare me in that the only Java I've conquered is evident from my empty
> french press...
> 
> I know I've brought it up in the past, but has anyone seen any community
> around UDFs start? I'd love to have a community that follows Apache like
> rules, and allows us to create and track UDFs to share... that would be
> pretty neat.   I guess if we were to do something like that, should one of
> us (I can volunteer) just start a Github project and encourage folks to
> come to the table or is there better way via Apache to do something like
> that?
> 
> On Wed, May 25, 2016 at 10:27 AM, Veera Naranammalpuram <
> vnaranammalpuram@maprtech.com> wrote:
> 
>> You could write a UDF. Or you could do something like this:
>> 
>> cat data.csv
>> 05/25/2016
>> 20160525
>> May 25th 2016
>> 
>> 0: jdbc:drill:> select case when columns[0] similar to '__/__/____' then
>> to_date(columns[0],'MM/dd/yyyy') when columns[0] similar to '________' then
>> to_date(columns[0],'yyyyMMdd') else NULL end from `data.csv`;
>> +-------------+
>> |   EXPR$0    |
>> +-------------+
>> | 2016-05-25  |
>> | 2016-05-25  |
>> | null        |
>> +-------------+
>> 3 rows selected (0.4 seconds)
>> 0: jdbc:drill:>
>> 
>> -Veera
>> 
>> On Wed, May 25, 2016 at 11:12 AM, Vince Gonzalez <vg...@mapr.com>
>> wrote:
>> 
>>> Sounds like a job for a UDF?
>>> 
>>> You could do the try/catch inside the UDF.
>>> 
>>> ----
>>> Vince Gonzalez
>>> Systems Engineer
>>> 212.694.3879
>>> 
>>> mapr.com
>>> 
>>> On Wed, May 25, 2016 at 11:05 AM, John Omernik <jo...@omernik.com> wrote:
>>> 
>>>> I have some DOBs, and some fields are empty others apparently were
>> filled
>>>> by trained monkeys, but while most data is accurate, some data is not.
>>>> 
>>>> As you saw from my other post, I am trying to get the age for those
>> DOBs
>>>> that are valid...
>>>> 
>>>> My function works, until I get to a record that is not valid and I get
>>>> something like this:
>>>> 
>>>> Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for
>> monthOfYear
>>>> must be in the range [1,12]
>>>> 
>>>> 
>>>> Is there a good "Try -> Except" type solution that will grant me the
>>> valid
>>>> data if things worked, and just return 0 or whatever I specify if there
>>> is
>>>> an error?
>>>> 
>>>> I could try casting the data, but if it fails won't it kill my query?
>>>> Basically I want it to keep going if it fails... not sure if Drill has
>>> this
>>>> ability, but thought I would ask.
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Veera Naranammalpuram
>> Product Specialist - SQL on Hadoop
>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
>> *(Email) vnaranammalpuram@maprtech.com <na...@maprtech.com>*
>> *(Mobile) 917 683 8116 - can text *
>> *Timezone: ET (UTC -5:00 / -4:00)*
>> 


Re: Is there a good way to handle bad date data?

Posted by Charles Givre <cg...@gmail.com>.
Sorry…
Let me restate in that I’m willing to assist with that… ;-)



> On May 25, 2016, at 12:55, John Omernik <jo...@omernik.com> wrote:
> 
> Cool, I wasn't aware of SIMILAR to (I learned something)  However, that
> doesn't work because my data is accurate i.e. '____-__-__' 2015-04-02 and
> 2015-00-23  but 00 doesn't work (bad data) .
> 
> UDFs scare me in that the only Java I've conquered is evident from my empty
> french press...
> 
> I know I've brought it up in the past, but has anyone seen any community
> around UDFs start? I'd love to have a community that follows Apache like
> rules, and allows us to create and track UDFs to share... that would be
> pretty neat.   I guess if we were to do something like that, should one of
> us (I can volunteer) just start a Github project and encourage folks to
> come to the table or is there better way via Apache to do something like
> that?
> 
> On Wed, May 25, 2016 at 10:27 AM, Veera Naranammalpuram <
> vnaranammalpuram@maprtech.com> wrote:
> 
>> You could write a UDF. Or you could do something like this:
>> 
>> cat data.csv
>> 05/25/2016
>> 20160525
>> May 25th 2016
>> 
>> 0: jdbc:drill:> select case when columns[0] similar to '__/__/____' then
>> to_date(columns[0],'MM/dd/yyyy') when columns[0] similar to '________' then
>> to_date(columns[0],'yyyyMMdd') else NULL end from `data.csv`;
>> +-------------+
>> |   EXPR$0    |
>> +-------------+
>> | 2016-05-25  |
>> | 2016-05-25  |
>> | null        |
>> +-------------+
>> 3 rows selected (0.4 seconds)
>> 0: jdbc:drill:>
>> 
>> -Veera
>> 
>> On Wed, May 25, 2016 at 11:12 AM, Vince Gonzalez <vg...@mapr.com>
>> wrote:
>> 
>>> Sounds like a job for a UDF?
>>> 
>>> You could do the try/catch inside the UDF.
>>> 
>>> ----
>>> Vince Gonzalez
>>> Systems Engineer
>>> 212.694.3879
>>> 
>>> mapr.com
>>> 
>>> On Wed, May 25, 2016 at 11:05 AM, John Omernik <jo...@omernik.com> wrote:
>>> 
>>>> I have some DOBs, and some fields are empty others apparently were
>> filled
>>>> by trained monkeys, but while most data is accurate, some data is not.
>>>> 
>>>> As you saw from my other post, I am trying to get the age for those
>> DOBs
>>>> that are valid...
>>>> 
>>>> My function works, until I get to a record that is not valid and I get
>>>> something like this:
>>>> 
>>>> Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for
>> monthOfYear
>>>> must be in the range [1,12]
>>>> 
>>>> 
>>>> Is there a good "Try -> Except" type solution that will grant me the
>>> valid
>>>> data if things worked, and just return 0 or whatever I specify if there
>>> is
>>>> an error?
>>>> 
>>>> I could try casting the data, but if it fails won't it kill my query?
>>>> Basically I want it to keep going if it fails... not sure if Drill has
>>> this
>>>> ability, but thought I would ask.
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Veera Naranammalpuram
>> Product Specialist - SQL on Hadoop
>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
>> *(Email) vnaranammalpuram@maprtech.com <na...@maprtech.com>*
>> *(Mobile) 917 683 8116 - can text *
>> *Timezone: ET (UTC -5:00 / -4:00)*
>> 


Re: Is there a good way to handle bad date data?

Posted by John Omernik <jo...@omernik.com>.
Ya, I've found that Repo and there are some others out there. Having a
place for them vs. having a community with some rules/guidelines are
different beasts all together. If I am over thinking this though, and the
community aspect is overkill for something like this, that's valid and
honest feedback.  I guess, I am looking for opinions on something like
this, I'd be happy to champion something like this, but want to see if
there some, at least initial interest around it.

John

On Wed, May 25, 2016 at 7:00 PM, Vince Gonzalez <vg...@mapr.com> wrote:

> John,
>
> Some MapR folk have collected some UDFs here:
> https://github.com/mapr-demos/simple-drill-functions
>
> Additions via pull request are welcome, I'm sure.
>
> On Wednesday, May 25, 2016, John Omernik <jo...@omernik.com> wrote:
>
> > So Charles here's how I'd set it up (I am not tied to this as I would
> love
> > to have others from Drill community feel like it's an open community ala
> > Apache, however, I am not sure the best way to approach)
> >
> > So, up to me, and this is just spitballing
> >
> > 1. Create a Github repo (I'd use my account just because, but if it makes
> > sense under the apache one I am not tied to it)
> > 2. Create a Readme that  describes what we have
> >    - I think UDFs should be grouped into folders under the repo, thing of
> > these as "groups of UDFs"  This is a human based grouping that makes it
> > easy to organize by some general types, say string, language processing.
> > Not sure the best way to approach this, but I want to make it a little
> bit
> > grouped, rather than flat to make it easy.
> >   - Each UDF would have it's own folder.
> >   - We could create a basic requirements for UDFs to be accepted, perhaps
> > certain tests, a README.md, LICENSE (we'd need people to submit to the
> > apache license) package.info, (explained below).  Readme would have
> > certain
> > data about how to use etc.
> >   - package.info  Here would be a file that has a json record that has
> > name, description, how to use, and tags.  It's kinda like the grouping by
> > folders, but it's used from an install perspective and from a package
> > management perspective.  See below)
> > 3. We won't keep jars in the repo, only source. But we will include a
> > docker file that will be as small as possible, and this will be used to
> > build on demand and UDF that someone wants to install.  Thus we can
> ensure
> > the UDFs build well on anyone system AND that people who want to use UDFs
> > don't have to be Java Experts
> > 4.  The package manager could have settings, like your Drill install
> > directory, and basically, it would build and install any UDF you want
> Now,
> > to keep things simple, the package manager can use the tags on the UDFs
> to
> > determine which udfs to build and then deploy, so you could build install
> > UDFs by tag (so you can say build all with tag X so you don't have to
> > individual ones) or you could build by name.
> > 5. The package manager would have list and search features that would use
> > the description, name and tags to help you search through the packages,
> and
> > provide a list of packages.  This could be a "pre" step prior to
> installing
> > allowing you to search, and only install whats needed based on what you
> > want.
> > 6. We can remove packages based on the install dir.
> > 7. How to we handle across nodes? Shared locations are great, or we could
> > create "install packages" i.e. after build we can bundle all jars into
> tgz
> > that can be deployed etc.
> >
> > Shrug, perhaps it's a bit verbose, but the idea here is we want to
> > encourage people to submit here, we want issues to be tracked, and we
> want
> > to have one place to send folks.
> >
> > I would still like to use the drill user list for discussion (at first)
> but
> > if it the UDF discussion grows to be to much noise, we'd need a new list.
> > All UDFs would have to be Apache Licensed, and like I said, maybe we
> prove
> > this out with the idea that we can get this moved to Apache. I am not
> sure,
> > does Apache do "related projects"  I.e. this on its own may not be an
> > Apache project, but to keep it within the Apache guidelines would be
> really
> > cool.
> >
> > So, that's a lot of stuff, but I am trying to toss out ideas more for
> > critique/discussion.
> >
> > So please, critique/discuss :)
> >
> > John
> >
> >
> >
> >
> >
> > On Wed, May 25, 2016 at 12:11 PM, MattK <matt@hybriddba.com
> <javascript:;>>
> > wrote:
> >
> > > UDFs scare me in that the only Java I've conquered is evident from my
> > empty
> > >> french press...
> > >>
> > >
> > > Same issue here. I have solved this in other platforms by
> pre-processing
> > > the data with a set of regex replacements in Awk:
> > >
> > > ~~~
> > > # "Repair" invalid dates as stored in MySQL (3 replacements for
> > > readability no slower than one nested)
> > > $0 = gensub(/0000-([0-9]{2}-[0-9]{2})/, "0001-\\1", "g", $0)
> > > $0 = gensub(/([0-9]{4})-00-([0-9]{2})/, "\\1-01-\\2", "g", $0)
> > > $0 = gensub(/([0-9]{4}-[0-9]{2})-00/, "\\1-01", "g", $0)
> > > ~~~
> > >
> > > But of course this adds another step in the pipeline. Perhaps something
> > > similar to could be implemented via
> > > https://drill.apache.org/docs/string-manipulation/#regexp_replace ?
> > >
> > >
> > >
> > >
> > > On 25 May 2016, at 12:55, John Omernik wrote:
> > >
> > > Cool, I wasn't aware of SIMILAR to (I learned something)  However, that
> > >> doesn't work because my data is accurate i.e. '____-__-__' 2015-04-02
> > and
> > >> 2015-00-23  but 00 doesn't work (bad data) .
> > >>
> > >> UDFs scare me in that the only Java I've conquered is evident from my
> > >> empty
> > >> french press...
> > >>
> > >> I know I've brought it up in the past, but has anyone seen any
> community
> > >> around UDFs start? I'd love to have a community that follows Apache
> like
> > >> rules, and allows us to create and track UDFs to share... that would
> be
> > >> pretty neat.   I guess if we were to do something like that, should
> one
> > of
> > >> us (I can volunteer) just start a Github project and encourage folks
> to
> > >> come to the table or is there better way via Apache to do something
> like
> > >> that?
> > >>
> > >> On Wed, May 25, 2016 at 10:27 AM, Veera Naranammalpuram <
> > >> vnaranammalpuram@maprtech.com <javascript:;>> wrote:
> > >>
> > >> You could write a UDF. Or you could do something like this:
> > >>>
> > >>> cat data.csv
> > >>> 05/25/2016
> > >>> 20160525
> > >>> May 25th 2016
> > >>>
> > >>> 0: jdbc:drill:> select case when columns[0] similar to '__/__/____'
> > then
> > >>> to_date(columns[0],'MM/dd/yyyy') when columns[0] similar to
> '________'
> > >>> then
> > >>> to_date(columns[0],'yyyyMMdd') else NULL end from `data.csv`;
> > >>> +-------------+
> > >>> |   EXPR$0    |
> > >>> +-------------+
> > >>> | 2016-05-25  |
> > >>> | 2016-05-25  |
> > >>> | null        |
> > >>> +-------------+
> > >>> 3 rows selected (0.4 seconds)
> > >>> 0: jdbc:drill:>
> > >>>
> > >>> -Veera
> > >>>
> > >>> On Wed, May 25, 2016 at 11:12 AM, Vince Gonzalez <vgonzalez@mapr.com
> > <javascript:;>>
> > >>> wrote:
> > >>>
> > >>> Sounds like a job for a UDF?
> > >>>>
> > >>>> You could do the try/catch inside the UDF.
> > >>>>
> > >>>>  ----
> > >>>>  Vince Gonzalez
> > >>>>  Systems Engineer
> > >>>>  212.694.3879
> > >>>>
> > >>>>  mapr.com
> > >>>>
> > >>>> On Wed, May 25, 2016 at 11:05 AM, John Omernik <john@omernik.com
> > <javascript:;>>
> > >>>> wrote:
> > >>>>
> > >>>> I have some DOBs, and some fields are empty others apparently were
> > >>>>>
> > >>>> filled
> > >>>
> > >>>> by trained monkeys, but while most data is accurate, some data is
> not.
> > >>>>>
> > >>>>> As you saw from my other post, I am trying to get the age for those
> > >>>>>
> > >>>> DOBs
> > >>>
> > >>>> that are valid...
> > >>>>>
> > >>>>> My function works, until I get to a record that is not valid and I
> > get
> > >>>>> something like this:
> > >>>>>
> > >>>>> Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for
> > >>>>>
> > >>>> monthOfYear
> > >>>
> > >>>> must be in the range [1,12]
> > >>>>>
> > >>>>>
> > >>>>> Is there a good "Try -> Except" type solution that will grant me
> the
> > >>>>>
> > >>>> valid
> > >>>>
> > >>>>> data if things worked, and just return 0 or whatever I specify if
> > there
> > >>>>>
> > >>>> is
> > >>>>
> > >>>>> an error?
> > >>>>>
> > >>>>> I could try casting the data, but if it fails won't it kill my
> query?
> > >>>>> Basically I want it to keep going if it fails... not sure if Drill
> > has
> > >>>>>
> > >>>> this
> > >>>>
> > >>>>> ability, but thought I would ask.
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Veera Naranammalpuram
> > >>> Product Specialist - SQL on Hadoop
> > >>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> > >>> *(Email) vnaranammalpuram@maprtech.com <javascript:;> <
> > naranammalpuram@maprtech.com <javascript:;>>*
> > >>> *(Mobile) 917 683 8116 - can text *
> > >>> *Timezone: ET (UTC -5:00 / -4:00)*
> > >>>
> > >>>
> >
>
>
> --
>  ----
>  Vince Gonzalez
>  Systems Engineer
>  212.694.3879
>
>  mapr.com
>

Re: Is there a good way to handle bad date data?

Posted by Vince Gonzalez <vg...@mapr.com>.
John,

Some MapR folk have collected some UDFs here:
https://github.com/mapr-demos/simple-drill-functions

Additions via pull request are welcome, I'm sure.

On Wednesday, May 25, 2016, John Omernik <jo...@omernik.com> wrote:

> So Charles here's how I'd set it up (I am not tied to this as I would love
> to have others from Drill community feel like it's an open community ala
> Apache, however, I am not sure the best way to approach)
>
> So, up to me, and this is just spitballing
>
> 1. Create a Github repo (I'd use my account just because, but if it makes
> sense under the apache one I am not tied to it)
> 2. Create a Readme that  describes what we have
>    - I think UDFs should be grouped into folders under the repo, thing of
> these as "groups of UDFs"  This is a human based grouping that makes it
> easy to organize by some general types, say string, language processing.
> Not sure the best way to approach this, but I want to make it a little bit
> grouped, rather than flat to make it easy.
>   - Each UDF would have it's own folder.
>   - We could create a basic requirements for UDFs to be accepted, perhaps
> certain tests, a README.md, LICENSE (we'd need people to submit to the
> apache license) package.info, (explained below).  Readme would have
> certain
> data about how to use etc.
>   - package.info  Here would be a file that has a json record that has
> name, description, how to use, and tags.  It's kinda like the grouping by
> folders, but it's used from an install perspective and from a package
> management perspective.  See below)
> 3. We won't keep jars in the repo, only source. But we will include a
> docker file that will be as small as possible, and this will be used to
> build on demand and UDF that someone wants to install.  Thus we can ensure
> the UDFs build well on anyone system AND that people who want to use UDFs
> don't have to be Java Experts
> 4.  The package manager could have settings, like your Drill install
> directory, and basically, it would build and install any UDF you want  Now,
> to keep things simple, the package manager can use the tags on the UDFs to
> determine which udfs to build and then deploy, so you could build install
> UDFs by tag (so you can say build all with tag X so you don't have to
> individual ones) or you could build by name.
> 5. The package manager would have list and search features that would use
> the description, name and tags to help you search through the packages, and
> provide a list of packages.  This could be a "pre" step prior to installing
> allowing you to search, and only install whats needed based on what you
> want.
> 6. We can remove packages based on the install dir.
> 7. How to we handle across nodes? Shared locations are great, or we could
> create "install packages" i.e. after build we can bundle all jars into tgz
> that can be deployed etc.
>
> Shrug, perhaps it's a bit verbose, but the idea here is we want to
> encourage people to submit here, we want issues to be tracked, and we want
> to have one place to send folks.
>
> I would still like to use the drill user list for discussion (at first) but
> if it the UDF discussion grows to be to much noise, we'd need a new list.
> All UDFs would have to be Apache Licensed, and like I said, maybe we prove
> this out with the idea that we can get this moved to Apache. I am not sure,
> does Apache do "related projects"  I.e. this on its own may not be an
> Apache project, but to keep it within the Apache guidelines would be really
> cool.
>
> So, that's a lot of stuff, but I am trying to toss out ideas more for
> critique/discussion.
>
> So please, critique/discuss :)
>
> John
>
>
>
>
>
> On Wed, May 25, 2016 at 12:11 PM, MattK <matt@hybriddba.com <javascript:;>>
> wrote:
>
> > UDFs scare me in that the only Java I've conquered is evident from my
> empty
> >> french press...
> >>
> >
> > Same issue here. I have solved this in other platforms by pre-processing
> > the data with a set of regex replacements in Awk:
> >
> > ~~~
> > # "Repair" invalid dates as stored in MySQL (3 replacements for
> > readability no slower than one nested)
> > $0 = gensub(/0000-([0-9]{2}-[0-9]{2})/, "0001-\\1", "g", $0)
> > $0 = gensub(/([0-9]{4})-00-([0-9]{2})/, "\\1-01-\\2", "g", $0)
> > $0 = gensub(/([0-9]{4}-[0-9]{2})-00/, "\\1-01", "g", $0)
> > ~~~
> >
> > But of course this adds another step in the pipeline. Perhaps something
> > similar to could be implemented via
> > https://drill.apache.org/docs/string-manipulation/#regexp_replace ?
> >
> >
> >
> >
> > On 25 May 2016, at 12:55, John Omernik wrote:
> >
> > Cool, I wasn't aware of SIMILAR to (I learned something)  However, that
> >> doesn't work because my data is accurate i.e. '____-__-__' 2015-04-02
> and
> >> 2015-00-23  but 00 doesn't work (bad data) .
> >>
> >> UDFs scare me in that the only Java I've conquered is evident from my
> >> empty
> >> french press...
> >>
> >> I know I've brought it up in the past, but has anyone seen any community
> >> around UDFs start? I'd love to have a community that follows Apache like
> >> rules, and allows us to create and track UDFs to share... that would be
> >> pretty neat.   I guess if we were to do something like that, should one
> of
> >> us (I can volunteer) just start a Github project and encourage folks to
> >> come to the table or is there better way via Apache to do something like
> >> that?
> >>
> >> On Wed, May 25, 2016 at 10:27 AM, Veera Naranammalpuram <
> >> vnaranammalpuram@maprtech.com <javascript:;>> wrote:
> >>
> >> You could write a UDF. Or you could do something like this:
> >>>
> >>> cat data.csv
> >>> 05/25/2016
> >>> 20160525
> >>> May 25th 2016
> >>>
> >>> 0: jdbc:drill:> select case when columns[0] similar to '__/__/____'
> then
> >>> to_date(columns[0],'MM/dd/yyyy') when columns[0] similar to '________'
> >>> then
> >>> to_date(columns[0],'yyyyMMdd') else NULL end from `data.csv`;
> >>> +-------------+
> >>> |   EXPR$0    |
> >>> +-------------+
> >>> | 2016-05-25  |
> >>> | 2016-05-25  |
> >>> | null        |
> >>> +-------------+
> >>> 3 rows selected (0.4 seconds)
> >>> 0: jdbc:drill:>
> >>>
> >>> -Veera
> >>>
> >>> On Wed, May 25, 2016 at 11:12 AM, Vince Gonzalez <vgonzalez@mapr.com
> <javascript:;>>
> >>> wrote:
> >>>
> >>> Sounds like a job for a UDF?
> >>>>
> >>>> You could do the try/catch inside the UDF.
> >>>>
> >>>>  ----
> >>>>  Vince Gonzalez
> >>>>  Systems Engineer
> >>>>  212.694.3879
> >>>>
> >>>>  mapr.com
> >>>>
> >>>> On Wed, May 25, 2016 at 11:05 AM, John Omernik <john@omernik.com
> <javascript:;>>
> >>>> wrote:
> >>>>
> >>>> I have some DOBs, and some fields are empty others apparently were
> >>>>>
> >>>> filled
> >>>
> >>>> by trained monkeys, but while most data is accurate, some data is not.
> >>>>>
> >>>>> As you saw from my other post, I am trying to get the age for those
> >>>>>
> >>>> DOBs
> >>>
> >>>> that are valid...
> >>>>>
> >>>>> My function works, until I get to a record that is not valid and I
> get
> >>>>> something like this:
> >>>>>
> >>>>> Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for
> >>>>>
> >>>> monthOfYear
> >>>
> >>>> must be in the range [1,12]
> >>>>>
> >>>>>
> >>>>> Is there a good "Try -> Except" type solution that will grant me the
> >>>>>
> >>>> valid
> >>>>
> >>>>> data if things worked, and just return 0 or whatever I specify if
> there
> >>>>>
> >>>> is
> >>>>
> >>>>> an error?
> >>>>>
> >>>>> I could try casting the data, but if it fails won't it kill my query?
> >>>>> Basically I want it to keep going if it fails... not sure if Drill
> has
> >>>>>
> >>>> this
> >>>>
> >>>>> ability, but thought I would ask.
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Veera Naranammalpuram
> >>> Product Specialist - SQL on Hadoop
> >>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> >>> *(Email) vnaranammalpuram@maprtech.com <javascript:;> <
> naranammalpuram@maprtech.com <javascript:;>>*
> >>> *(Mobile) 917 683 8116 - can text *
> >>> *Timezone: ET (UTC -5:00 / -4:00)*
> >>>
> >>>
>


-- 
 ----
 Vince Gonzalez
 Systems Engineer
 212.694.3879

 mapr.com

Re: Is there a good way to handle bad date data?

Posted by John Omernik <jo...@omernik.com>.
So Charles here's how I'd set it up (I am not tied to this as I would love
to have others from Drill community feel like it's an open community ala
Apache, however, I am not sure the best way to approach)

So, up to me, and this is just spitballing

1. Create a Github repo (I'd use my account just because, but if it makes
sense under the apache one I am not tied to it)
2. Create a Readme that  describes what we have
   - I think UDFs should be grouped into folders under the repo, thing of
these as "groups of UDFs"  This is a human based grouping that makes it
easy to organize by some general types, say string, language processing.
Not sure the best way to approach this, but I want to make it a little bit
grouped, rather than flat to make it easy.
  - Each UDF would have it's own folder.
  - We could create a basic requirements for UDFs to be accepted, perhaps
certain tests, a README.md, LICENSE (we'd need people to submit to the
apache license) package.info, (explained below).  Readme would have certain
data about how to use etc.
  - package.info  Here would be a file that has a json record that has
name, description, how to use, and tags.  It's kinda like the grouping by
folders, but it's used from an install perspective and from a package
management perspective.  See below)
3. We won't keep jars in the repo, only source. But we will include a
docker file that will be as small as possible, and this will be used to
build on demand and UDF that someone wants to install.  Thus we can ensure
the UDFs build well on anyone system AND that people who want to use UDFs
don't have to be Java Experts
4.  The package manager could have settings, like your Drill install
directory, and basically, it would build and install any UDF you want  Now,
to keep things simple, the package manager can use the tags on the UDFs to
determine which udfs to build and then deploy, so you could build install
UDFs by tag (so you can say build all with tag X so you don't have to
individual ones) or you could build by name.
5. The package manager would have list and search features that would use
the description, name and tags to help you search through the packages, and
provide a list of packages.  This could be a "pre" step prior to installing
allowing you to search, and only install whats needed based on what you
want.
6. We can remove packages based on the install dir.
7. How to we handle across nodes? Shared locations are great, or we could
create "install packages" i.e. after build we can bundle all jars into tgz
that can be deployed etc.

Shrug, perhaps it's a bit verbose, but the idea here is we want to
encourage people to submit here, we want issues to be tracked, and we want
to have one place to send folks.

I would still like to use the drill user list for discussion (at first) but
if it the UDF discussion grows to be to much noise, we'd need a new list.
All UDFs would have to be Apache Licensed, and like I said, maybe we prove
this out with the idea that we can get this moved to Apache. I am not sure,
does Apache do "related projects"  I.e. this on its own may not be an
Apache project, but to keep it within the Apache guidelines would be really
cool.

So, that's a lot of stuff, but I am trying to toss out ideas more for
critique/discussion.

So please, critique/discuss :)

John





On Wed, May 25, 2016 at 12:11 PM, MattK <ma...@hybriddba.com> wrote:

> UDFs scare me in that the only Java I've conquered is evident from my empty
>> french press...
>>
>
> Same issue here. I have solved this in other platforms by pre-processing
> the data with a set of regex replacements in Awk:
>
> ~~~
> # "Repair" invalid dates as stored in MySQL (3 replacements for
> readability no slower than one nested)
> $0 = gensub(/0000-([0-9]{2}-[0-9]{2})/, "0001-\\1", "g", $0)
> $0 = gensub(/([0-9]{4})-00-([0-9]{2})/, "\\1-01-\\2", "g", $0)
> $0 = gensub(/([0-9]{4}-[0-9]{2})-00/, "\\1-01", "g", $0)
> ~~~
>
> But of course this adds another step in the pipeline. Perhaps something
> similar to could be implemented via
> https://drill.apache.org/docs/string-manipulation/#regexp_replace ?
>
>
>
>
> On 25 May 2016, at 12:55, John Omernik wrote:
>
> Cool, I wasn't aware of SIMILAR to (I learned something)  However, that
>> doesn't work because my data is accurate i.e. '____-__-__' 2015-04-02 and
>> 2015-00-23  but 00 doesn't work (bad data) .
>>
>> UDFs scare me in that the only Java I've conquered is evident from my
>> empty
>> french press...
>>
>> I know I've brought it up in the past, but has anyone seen any community
>> around UDFs start? I'd love to have a community that follows Apache like
>> rules, and allows us to create and track UDFs to share... that would be
>> pretty neat.   I guess if we were to do something like that, should one of
>> us (I can volunteer) just start a Github project and encourage folks to
>> come to the table or is there better way via Apache to do something like
>> that?
>>
>> On Wed, May 25, 2016 at 10:27 AM, Veera Naranammalpuram <
>> vnaranammalpuram@maprtech.com> wrote:
>>
>> You could write a UDF. Or you could do something like this:
>>>
>>> cat data.csv
>>> 05/25/2016
>>> 20160525
>>> May 25th 2016
>>>
>>> 0: jdbc:drill:> select case when columns[0] similar to '__/__/____' then
>>> to_date(columns[0],'MM/dd/yyyy') when columns[0] similar to '________'
>>> then
>>> to_date(columns[0],'yyyyMMdd') else NULL end from `data.csv`;
>>> +-------------+
>>> |   EXPR$0    |
>>> +-------------+
>>> | 2016-05-25  |
>>> | 2016-05-25  |
>>> | null        |
>>> +-------------+
>>> 3 rows selected (0.4 seconds)
>>> 0: jdbc:drill:>
>>>
>>> -Veera
>>>
>>> On Wed, May 25, 2016 at 11:12 AM, Vince Gonzalez <vg...@mapr.com>
>>> wrote:
>>>
>>> Sounds like a job for a UDF?
>>>>
>>>> You could do the try/catch inside the UDF.
>>>>
>>>>  ----
>>>>  Vince Gonzalez
>>>>  Systems Engineer
>>>>  212.694.3879
>>>>
>>>>  mapr.com
>>>>
>>>> On Wed, May 25, 2016 at 11:05 AM, John Omernik <jo...@omernik.com>
>>>> wrote:
>>>>
>>>> I have some DOBs, and some fields are empty others apparently were
>>>>>
>>>> filled
>>>
>>>> by trained monkeys, but while most data is accurate, some data is not.
>>>>>
>>>>> As you saw from my other post, I am trying to get the age for those
>>>>>
>>>> DOBs
>>>
>>>> that are valid...
>>>>>
>>>>> My function works, until I get to a record that is not valid and I get
>>>>> something like this:
>>>>>
>>>>> Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for
>>>>>
>>>> monthOfYear
>>>
>>>> must be in the range [1,12]
>>>>>
>>>>>
>>>>> Is there a good "Try -> Except" type solution that will grant me the
>>>>>
>>>> valid
>>>>
>>>>> data if things worked, and just return 0 or whatever I specify if there
>>>>>
>>>> is
>>>>
>>>>> an error?
>>>>>
>>>>> I could try casting the data, but if it fails won't it kill my query?
>>>>> Basically I want it to keep going if it fails... not sure if Drill has
>>>>>
>>>> this
>>>>
>>>>> ability, but thought I would ask.
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Veera Naranammalpuram
>>> Product Specialist - SQL on Hadoop
>>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
>>> *(Email) vnaranammalpuram@maprtech.com <na...@maprtech.com>*
>>> *(Mobile) 917 683 8116 - can text *
>>> *Timezone: ET (UTC -5:00 / -4:00)*
>>>
>>>

Re: Is there a good way to handle bad date data?

Posted by MattK <ma...@hybriddba.com>.
> UDFs scare me in that the only Java I've conquered is evident from my 
> empty
> french press...

Same issue here. I have solved this in other platforms by pre-processing 
the data with a set of regex replacements in Awk:

~~~
# "Repair" invalid dates as stored in MySQL (3 replacements for 
readability no slower than one nested)
$0 = gensub(/0000-([0-9]{2}-[0-9]{2})/, "0001-\\1", "g", $0)
$0 = gensub(/([0-9]{4})-00-([0-9]{2})/, "\\1-01-\\2", "g", $0)
$0 = gensub(/([0-9]{4}-[0-9]{2})-00/, "\\1-01", "g", $0)
~~~

But of course this adds another step in the pipeline. Perhaps something 
similar to could be implemented via 
https://drill.apache.org/docs/string-manipulation/#regexp_replace ?



On 25 May 2016, at 12:55, John Omernik wrote:

> Cool, I wasn't aware of SIMILAR to (I learned something)  However, 
> that
> doesn't work because my data is accurate i.e. '____-__-__' 2015-04-02 
> and
> 2015-00-23  but 00 doesn't work (bad data) .
>
> UDFs scare me in that the only Java I've conquered is evident from my 
> empty
> french press...
>
> I know I've brought it up in the past, but has anyone seen any 
> community
> around UDFs start? I'd love to have a community that follows Apache 
> like
> rules, and allows us to create and track UDFs to share... that would 
> be
> pretty neat.   I guess if we were to do something like that, should 
> one of
> us (I can volunteer) just start a Github project and encourage folks 
> to
> come to the table or is there better way via Apache to do something 
> like
> that?
>
> On Wed, May 25, 2016 at 10:27 AM, Veera Naranammalpuram <
> vnaranammalpuram@maprtech.com> wrote:
>
>> You could write a UDF. Or you could do something like this:
>>
>> cat data.csv
>> 05/25/2016
>> 20160525
>> May 25th 2016
>>
>> 0: jdbc:drill:> select case when columns[0] similar to '__/__/____' 
>> then
>> to_date(columns[0],'MM/dd/yyyy') when columns[0] similar to 
>> '________' then
>> to_date(columns[0],'yyyyMMdd') else NULL end from `data.csv`;
>> +-------------+
>> |   EXPR$0    |
>> +-------------+
>> | 2016-05-25  |
>> | 2016-05-25  |
>> | null        |
>> +-------------+
>> 3 rows selected (0.4 seconds)
>> 0: jdbc:drill:>
>>
>> -Veera
>>
>> On Wed, May 25, 2016 at 11:12 AM, Vince Gonzalez <vg...@mapr.com>
>> wrote:
>>
>>> Sounds like a job for a UDF?
>>>
>>> You could do the try/catch inside the UDF.
>>>
>>>  ----
>>>  Vince Gonzalez
>>>  Systems Engineer
>>>  212.694.3879
>>>
>>>  mapr.com
>>>
>>> On Wed, May 25, 2016 at 11:05 AM, John Omernik <jo...@omernik.com> 
>>> wrote:
>>>
>>>> I have some DOBs, and some fields are empty others apparently were
>> filled
>>>> by trained monkeys, but while most data is accurate, some data is 
>>>> not.
>>>>
>>>> As you saw from my other post, I am trying to get the age for those
>> DOBs
>>>> that are valid...
>>>>
>>>> My function works, until I get to a record that is not valid and I 
>>>> get
>>>> something like this:
>>>>
>>>> Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for
>> monthOfYear
>>>> must be in the range [1,12]
>>>>
>>>>
>>>> Is there a good "Try -> Except" type solution that will grant me 
>>>> the
>>> valid
>>>> data if things worked, and just return 0 or whatever I specify if 
>>>> there
>>> is
>>>> an error?
>>>>
>>>> I could try casting the data, but if it fails won't it kill my 
>>>> query?
>>>> Basically I want it to keep going if it fails... not sure if Drill 
>>>> has
>>> this
>>>> ability, but thought I would ask.
>>>>
>>>
>>
>>
>>
>> --
>> Veera Naranammalpuram
>> Product Specialist - SQL on Hadoop
>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
>> *(Email) vnaranammalpuram@maprtech.com 
>> <na...@maprtech.com>*
>> *(Mobile) 917 683 8116 - can text *
>> *Timezone: ET (UTC -5:00 / -4:00)*
>>

Re: Is there a good way to handle bad date data?

Posted by John Omernik <jo...@omernik.com>.
Cool, I wasn't aware of SIMILAR to (I learned something)  However, that
doesn't work because my data is accurate i.e. '____-__-__' 2015-04-02 and
2015-00-23  but 00 doesn't work (bad data) .

UDFs scare me in that the only Java I've conquered is evident from my empty
french press...

I know I've brought it up in the past, but has anyone seen any community
around UDFs start? I'd love to have a community that follows Apache like
rules, and allows us to create and track UDFs to share... that would be
pretty neat.   I guess if we were to do something like that, should one of
us (I can volunteer) just start a Github project and encourage folks to
come to the table or is there better way via Apache to do something like
that?

On Wed, May 25, 2016 at 10:27 AM, Veera Naranammalpuram <
vnaranammalpuram@maprtech.com> wrote:

> You could write a UDF. Or you could do something like this:
>
> cat data.csv
> 05/25/2016
> 20160525
> May 25th 2016
>
> 0: jdbc:drill:> select case when columns[0] similar to '__/__/____' then
> to_date(columns[0],'MM/dd/yyyy') when columns[0] similar to '________' then
> to_date(columns[0],'yyyyMMdd') else NULL end from `data.csv`;
> +-------------+
> |   EXPR$0    |
> +-------------+
> | 2016-05-25  |
> | 2016-05-25  |
> | null        |
> +-------------+
> 3 rows selected (0.4 seconds)
> 0: jdbc:drill:>
>
> -Veera
>
> On Wed, May 25, 2016 at 11:12 AM, Vince Gonzalez <vg...@mapr.com>
> wrote:
>
> > Sounds like a job for a UDF?
> >
> > You could do the try/catch inside the UDF.
> >
> >  ----
> >  Vince Gonzalez
> >  Systems Engineer
> >  212.694.3879
> >
> >  mapr.com
> >
> > On Wed, May 25, 2016 at 11:05 AM, John Omernik <jo...@omernik.com> wrote:
> >
> > > I have some DOBs, and some fields are empty others apparently were
> filled
> > > by trained monkeys, but while most data is accurate, some data is not.
> > >
> > > As you saw from my other post, I am trying to get the age for those
> DOBs
> > > that are valid...
> > >
> > > My function works, until I get to a record that is not valid and I get
> > > something like this:
> > >
> > > Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for
> monthOfYear
> > > must be in the range [1,12]
> > >
> > >
> > > Is there a good "Try -> Except" type solution that will grant me the
> > valid
> > > data if things worked, and just return 0 or whatever I specify if there
> > is
> > > an error?
> > >
> > > I could try casting the data, but if it fails won't it kill my query?
> > > Basically I want it to keep going if it fails... not sure if Drill has
> > this
> > > ability, but thought I would ask.
> > >
> >
>
>
>
> --
> Veera Naranammalpuram
> Product Specialist - SQL on Hadoop
> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> *(Email) vnaranammalpuram@maprtech.com <na...@maprtech.com>*
> *(Mobile) 917 683 8116 - can text *
> *Timezone: ET (UTC -5:00 / -4:00)*
>

Re: Is there a good way to handle bad date data?

Posted by Veera Naranammalpuram <vn...@maprtech.com>.
You could write a UDF. Or you could do something like this:

cat data.csv
05/25/2016
20160525
May 25th 2016

0: jdbc:drill:> select case when columns[0] similar to '__/__/____' then
to_date(columns[0],'MM/dd/yyyy') when columns[0] similar to '________' then
to_date(columns[0],'yyyyMMdd') else NULL end from `data.csv`;
+-------------+
|   EXPR$0    |
+-------------+
| 2016-05-25  |
| 2016-05-25  |
| null        |
+-------------+
3 rows selected (0.4 seconds)
0: jdbc:drill:>

-Veera

On Wed, May 25, 2016 at 11:12 AM, Vince Gonzalez <vg...@mapr.com> wrote:

> Sounds like a job for a UDF?
>
> You could do the try/catch inside the UDF.
>
>  ----
>  Vince Gonzalez
>  Systems Engineer
>  212.694.3879
>
>  mapr.com
>
> On Wed, May 25, 2016 at 11:05 AM, John Omernik <jo...@omernik.com> wrote:
>
> > I have some DOBs, and some fields are empty others apparently were filled
> > by trained monkeys, but while most data is accurate, some data is not.
> >
> > As you saw from my other post, I am trying to get the age for those DOBs
> > that are valid...
> >
> > My function works, until I get to a record that is not valid and I get
> > something like this:
> >
> > Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for monthOfYear
> > must be in the range [1,12]
> >
> >
> > Is there a good "Try -> Except" type solution that will grant me the
> valid
> > data if things worked, and just return 0 or whatever I specify if there
> is
> > an error?
> >
> > I could try casting the data, but if it fails won't it kill my query?
> > Basically I want it to keep going if it fails... not sure if Drill has
> this
> > ability, but thought I would ask.
> >
>



-- 
Veera Naranammalpuram
Product Specialist - SQL on Hadoop
*MapR Technologies (www.mapr.com <http://www.mapr.com>)*
*(Email) vnaranammalpuram@maprtech.com <na...@maprtech.com>*
*(Mobile) 917 683 8116 - can text *
*Timezone: ET (UTC -5:00 / -4:00)*

Re: Is there a good way to handle bad date data?

Posted by Vince Gonzalez <vg...@mapr.com>.
Sounds like a job for a UDF?

You could do the try/catch inside the UDF.

 ----
 Vince Gonzalez
 Systems Engineer
 212.694.3879

 mapr.com

On Wed, May 25, 2016 at 11:05 AM, John Omernik <jo...@omernik.com> wrote:

> I have some DOBs, and some fields are empty others apparently were filled
> by trained monkeys, but while most data is accurate, some data is not.
>
> As you saw from my other post, I am trying to get the age for those DOBs
> that are valid...
>
> My function works, until I get to a record that is not valid and I get
> something like this:
>
> Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for monthOfYear
> must be in the range [1,12]
>
>
> Is there a good "Try -> Except" type solution that will grant me the valid
> data if things worked, and just return 0 or whatever I specify if there is
> an error?
>
> I could try casting the data, but if it fails won't it kill my query?
> Basically I want it to keep going if it fails... not sure if Drill has this
> ability, but thought I would ask.
>