You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Hsuan Yi Chu <hy...@maprtech.com> on 2015/10/20 18:26:52 UTC

[Design Document] Support the Ability to Identify And Skip Records when Function Evaluations Fail

Hi All,
I am assuming now is the best time to ask for feedback on one design
proposal.

We just finished a Design Document for a new feature which allows users to
identify and skip records when function evaluations fail. (Thanks for Aman,
Mehant, Parth's contributions to the draft)

https://docs.google.com/document/d/1jCeYW924_SFwf-nOqtXrO68eixmAitM-tLngezzXw3Y/edit
(This can be found in DRILL-3764
<https://issues.apache.org/jira/browse/DRILL-3764> too)

Your comments and feedback are highly appreciated.

Re: [Design Document] Support the Ability to Identify And Skip Records when Function Evaluations Fail

Posted by Mustafa Engin Sözer <me...@goodgamestudios.com>.
Sorry guys, accidentally answered the question here, did not realize this
was another topic.

On 22 October 2015 at 13:29, Mustafa Engin Sözer <
mesoezer@goodgamestudios.com> wrote:

> Hi,
>
> If you only need to show the data but not some sort of report (e.g. as in
> Microstrategy, Tableau), then you can basically use most SQL clients, for
> instance SQuirrel, Heidi SQL etc. via jdbc or odbc connection . The
> following links should help you manage that:
>
> https://drill.apache.org/docs/odbc-jdbc-interfaces/
> https://drill.apache.org/docs/using-the-jdbc-driver/
>
> For the Hadoop data problem, can you share your storage plugin settings?
> Normally you shouldn't have any problems with seeing or querying hdfs data.
>
>
>
> On 22 October 2015 at 13:06, ganesh <g4...@gmail.com> wrote:
>
>> Hello,
>>
>> John, you seem to be quite impressed with apache drill .. nice.
>> I am new to un-structured world and just started 1 week back on APACHE
>> DRILL after suggestion from my collegues. We have a semi structured data
>> where we have constraint that we do not know number of columns
>>
>> I heard that APACHE DRILL is column free applicationa nd with support of
>> JSON format, it allows to create columns on-fly,
>> I converted my data from CSV-like-format to JSON and trying to figure out
>> if it will work for me.
>>
>> Here I hit two issues :-
>> 1) My column were like : 3100.2.1.2  and values like '-2303" or
>> '01/01/2015
>> 02:02:00"
>>
>> Challenge was that column cant be started with Numeric value. So I had to
>> change key as: "t3100.2.1.2"
>> After that things were quite OK,
>>
>> Now I need some help from you guys. To proceed I have to present my work
>> to
>> management as an example.
>> But querying on apache drill console, doesnt seem to be an attractive way
>> to present things.
>>
>> I tried drill explorer too.But didnt find that so good.
>> One thing to note, I am playing with files on Hadoop standalone mode in
>> UBUNTU.
>>
>> To make it appear more good looking, I started with QLIK SENSE .. but was
>> unable to connect it with hadoop file system. It only showed me HIVE
>> FILES.
>> Then I downloaded TABLEAU Trial version ... but I am unable to get Hadoop
>> data here too...
>>
>> Please help me how to proceed. I have presentation on coming Monday.
>> Queries are quite ready .. I just need to show in visualization form
>> ........ using OPEN SOURCE applications only.
>>
>>
>> Guys please help me.
>>
>>
>>
>> On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <jo...@omernik.com> wrote:
>>
>> > AWESOME!
>> >
>> > I had just been in the process of writing up a long user story to ask
>> for
>> > and support exactly this.   I modified it and included it here:
>> >
>> >
>> > To start out, I want to say how much I love the Drill project, and the
>> > potential it has. I've put this together based on my experiences and
>> want
>> > to contribute a perspective as a user, not just put a bunch of
>> critiques in
>> > an email.  I hope it's all taken in that spirit.  Additional note, I
>> wrote
>> > this prior to seeing the Design Document share by Hsuan Yi Chu
>> yesterday.
>> > If you are reading it, and think to yourself “that wording is odd…”
>> please
>> > consider it from the “I didn’t want to throw away the user story”
>> > perspective and the “I wrote it before the design doc” perspective.
>> >
>> >
>> >
>> > Additionally, I understand that some of what I am suggesting may not be
>> > easy from a development perspective.  I am just being upfront with my
>> > experience, so we can look to determine what can be done; I am not
>> looking
>> > for a silver bullet here, just looking for improvement.  Some may be as
>> > simple as better documentation, other suggestions may be harder to
>> > implement.  Either way, I thought a verbose user story might be useful
>> to
>> > the community as a whole.
>> >
>> >
>> >
>> > John
>> >
>> >
>> >
>> > *User Story*
>> >
>> >
>> >
>> > As I have been working with Drill for data exploration, I came across
>> > multiple "things" that just were hard.  In dealing with some data,
>> > especially JSON data, it can be ugly, and scaled ugly is even worse!
>> >
>> >
>> >
>> > For this story, I am working with a JSON dump from MongoDB, and you
>> would
>> > think it would be well structured, and for the most part it is.  There
>> are
>> > some application level mistakes that were made (I will go into that in a
>> > moment), but in general Drill handles this well.  So with this data set,
>> > there are a few main challenges I am seeing:
>> >
>> >
>> >
>> > 1.     When there is a field that has a float, and then a later record
>> has
>> > the number 0 in it (which Drill takes as a INT). This is a known problem
>> > and one that Drill has a solution for.
>> >
>> > 2.     When there is a field is of one type (a map) and then a later
>> record
>> > has a string in it.  No easy solution here.
>> >
>> > 3.     Select * where there is a json field with a . in the name. I
>> won’t
>> > go into details here, but I feel this factors into data exploration,
>> > because it changes the ability to “stay in Drill” to explore their data
>> (
>> > https://issues.apache.org/jira/browse/DRILL-3922)
>> >
>> > 4.     Error reporting challenges
>> >
>> >
>> >
>> >
>> >
>> > With the problem summary laid out, I wanted to walk through my process
>> in
>> > working with this data, and where, if I were a user Drill could have
>> been
>> > much more helpful to the process.
>> >
>> >
>> >
>> > Here is a description of the process I went through:
>> >
>> >
>> >
>> > 1.     Copy data into filesystem
>> >
>> > 2.     Use drill to “Select * from `path_to/dump.json` limit 1
>> >
>> > 3.     (I just want to see what it looks like!)
>> >
>> >
>> >
>> >
>> >
>> > Here I get this error:
>> >
>> >
>> >
>> > > select * from `path_to/ dump.json` limit 1;
>> >
>> > Error: DATA_READ ERROR: You tried to write a BigInt type when you are
>> using
>> > a ValueWriter of type NullableFloat8WriterImpl.
>> >
>> >
>> >
>> > File  /data/dev/path_to/dump.json
>> >
>> > Record  1
>> >
>> > Line  1
>> >
>> > Column  9054
>> >
>> > Field  entropy
>> >
>> > Fragment 0:0
>> >
>> >
>> >
>> > This isn’t incredibly helpful from a user perspective.  I.e. When I
>> Google
>> > around, I realize now that in the docs it talks about “Schema Changes”
>> and
>> > one possible item is use the setting below. However, examples of the
>> data
>> > that was trying to be displayed (with it’s implied type) may help users
>> > grok what is happening.  At least in this case it showed me the field
>> name!
>> >
>> >
>> >
>> > ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
>> >
>> >
>> >
>> > This is a great example where since we have known use case (when numbers
>> > are doubles but someone tries to store 0 an INT) it fails, thus dev’s
>> have
>> > added a setting to allow a user to get through that, that the error
>> message
>> > could be more helpful.   In this case, Showing two record numbers (line
>> > numbers) with different types, the field values with their implied
>> types,
>> > and perhaps a suggestion about using the setting to address the problem.
>> > This could make it more intuitive for the user to stay in Drill, and
>> stay
>> > in the data.   In this case, I looked at a head of the file, and saw the
>> > issue and was able to proceed.
>> >
>> >
>> >
>> > Also, as a corollary here, the user documentation does not show this
>> error
>> > related to the schema change problem. This would be a great place to
>> state,
>> > “if you see an error that looks like X, this is what is happening and
>> what
>> > you can do for it.”
>> >
>> >
>> >
>> >
>> >
>> > *Side node on documentation*
>> >
>> > We should look to have documentation try to be role based.   In this
>> case,
>> > the documentation says use “ALTER SYSTEM” I would argue, and I am
>> guessing
>> > others would concur, that for this use case, “ALTER SESSION” may be a
>> > better suggestion as this is specific alteration to address the use
>> case of
>> > loading/querying a specific data set, and is likely done by a user of
>> the
>> > system.
>> >
>> >
>> >
>> > If a user is doing self-serve data, then in an enterprise environment,
>> they
>> > may not have the ability to use ALTER SYSTEM and get an error, thus may
>> be
>> > confused on how to proceed.   In addition ALTER SYSTEM by a user who
>> > doesn’t understand that they are changing, yet have the rights to
>> change,
>> > may introduce future data problems they didn’t expect.   I like that the
>> > default is a more constrictive method, because it makes people be
>> explicit
>> > about data, yet the documentation should also aim to be explicit about
>> > something like a system wide change.
>> >
>> >
>> >
>> >
>> >
>> > *Back to the story*
>> >
>> > Ok so now I will do ALTER SESSION SET on the read_numbers_as_double
>> setting
>> >
>> >
>> >
>> > I run the query again.
>> >
>> >
>> >
>> > > select * from `path_to/dump.json` limit 1;
>> >
>> > Error: DATA_READ ERROR: Error parsing JSON - You tried to write a
>> VarChar
>> > type when you are using a ValueWriter of type SingleMapWriter.
>> >
>> >
>> >
>> > File  /data/dev/path_to/dump.json
>> >
>> > Record  4009
>> >
>> > Fragment 0:0
>> >
>> >
>> >
>> > Another error   But what does this one mean? Ok, now that I have been
>> > living in the docs and in the Drill user list, and because it’s similar
>> to
>> > the schema change issue, that that is what we are looking at here.
>> Instead
>> > of double to int, we have one field that is map most of the time, and in
>> > some cases it’s a string.
>> >
>> >
>> >
>> > But this doesn’t really help me as a user.  To troubleshoot this Drill
>> > doesn’t offer any options. This file is 500 MB of dense and nested JSON
>> > data with 51k records.   My solution? I took the record number, then I
>> went
>> > to my NFS mounted clustered file system (thank goodness I had MapR
>> here, I
>> > am not sure how I would have done this with Posix tools)
>> >
>> >
>> >
>> > My command: $ head -4009 dump.json|tail -1
>> >
>> >
>> >
>> > That (I hoped) showed me the record in question, note the error from
>> Drill
>> > didn’t tell me which field was at fault here, so I had to visually align
>> > things to address that.  However, I was able to spot the difference and
>> > work with the dev to understand why that happened. I removed those
>> records,
>> > and things worked correctly.
>> >
>> >
>> >
>> > Could there have been a way to identify that within drill? My solution
>> was
>> > to take a python script and read through, and discard those records that
>> > were not a map, however, on 500MB that can work, but what about 500
>> GB?  I
>> > guess a Spark job could clean the data…. But could Drill be given some
>> > tools to help with this situation?
>> >
>> >
>> >
>> > For example, the first thing I said was: What field is at issue?  I had
>> no
>> > way to see what was up there.  I had to use other tools to see the data
>> so
>> > I could understand the problem. Then when I understood the problem, I
>> had
>> > to use Python to produce data that was queryable.
>> >
>> >
>> >
>> > Based on the design document Hsuan Yi Chu just posted to the mailing
>> list,
>> > at this point my post is just a user story to support the design
>> document.
>> > To summarize the points I’d like to see included in the design document
>> > (from a user perspective), not understanding “how or why”:
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > *1.     **Error messages that are more verbose in explaining the
>> problem*
>> >
>> > a.     Filename, row number, column number or name
>> >
>> > b.     Option to output the “offending row”
>> >
>> > c.     Showing the data that is causing the error WITH the type Drill
>> > inferred.
>> >
>> > d.     If there are options to help work through dirty data, perhaps the
>> > error message could include those: “Data was an double, then drill found
>> > this data: 0 that was a int in File x, at row 24 in column
>> “myfloatingdata”
>> > consider using store.json.read_numbers_as_double to address the issue.
>> >
>> > 2.     *A way to determine how common this exception is*
>> >
>> > a.     If I am playing with a messy data set, and this error happens,
>> does
>> > it happen on 1 record? 2? 5000?  Knowing that information would:
>> >
>> >                                                i.     Help users
>> understand
>> > how Drill is seeing that particular column
>> >
>> >                                              ii.     Make decisions on
>> > excluding data rather than just removing it. What if the first 10
>> records
>> > were errors, and then you excluded the remaining 10 million because they
>> > were correct yet different from the first 10?
>> >
>> > b.     Perhaps there could be a “stats” function that only works if it’s
>> > the only selected item or if the select is all those functions (stats
>> > functions)?
>> >
>> >                                                i.     Select
>> > type_stats(fieldsname) from data
>> >
>> >                                              ii.      (that wouldn’t
>> error
>> > on different types)
>> >
>> > 3.     *An ability to set a “return null on this field if error or if
>> non
>> > castable to X type, especially in a view, perhaps in a function.*
>> >
>> > a.     Allow them to not have to reparse data outside drill
>> >
>> > b.     Load it into a sane format (one time loads/ETL to clean data)
>> >
>> > c.     Not be system or session wide exception.
>> >
>> >                                                i.     I think this is
>> > important because I may have a field where I want it to read the
>> numbers as
>> > double, but what if I have another field in the same dataset where I
>> don’t
>> > want it to read the numbers as double? A SYSTEM or SESSION level
>> variable
>> > takes away that granularity
>> >
>> > d.     Select field1, CASTORNULL(field2, int) as field2,
>> CASTORNULL(field3,
>> > double) as field3 from ugly_data.
>> >
>> > e.     That’s an example when it’s in the select, but I Could see a
>> where
>> > clause
>> >
>> > f.      Select field1, field2, field3 from ugly data where
>> ISTYPE(field2,
>> > int) and ISTYPE(field3, double)
>> >
>> > 4.     *Updating of the documentation related to ALTER SESSION vs ALTER
>> > SYSTEM with an eye to the context of the majority use case of the
>> > documented feature*
>> >
>> > a.     For data loads, the documentation uses ALTER SYSTEM and that’s
>> > problematic because:
>> >
>> >                                                i.     Not all users have
>> > the privileges to issue an ALTER SYSTEM. Thus a new user trying to
>> figure
>> > things out may not realize they can just ALTER SESSION after getting an
>> > ALTER SYSTEM error.
>> >
>> >                                              ii.     ALTER SYSTEM on
>> data
>> > loading items, especially in areas that make Drill’s data interpretation
>> > more permissive can lead to unintended consequences later. An admin, who
>> > may be a good systems admin, and helps a data user troubleshoot and
>> error
>> > may issue an ALTER SYSTEM not realizing this changes all future data
>> > imports.
>> >
>> > b.     Note, I found a few cases, but I would suggest a thorough review
>> of
>> > the various use cases throughout the documentation, and in areas where
>> it
>> > really could be either, have a small paragraph indicating the
>> ramifications
>> > of either command.
>> >
>> > *5.     **A Philosophy within the Drill Community to “Stay in Drill” for
>> > data exploration*
>> >
>> > a.     This is obviously not as much of a development thing as a
>> mindset.
>> > If someone says “I tried to do X, and I got and error” and the
>> communities
>> > response is Y where Y is “Look through your data and do Z to it so Drill
>> > can read it” then we should reconsider that scenario and try to provide
>> and
>> > option within Drill to intuitively handle the edge case.  This is
>> > difficult.
>> >
>> > b.     There are cases even in the documentation where this is the case:
>> > https://drill.apache.org/docs/json-data-model/ talking about arrays at
>> the
>> > root level or reading some empty arrays.  In these cases, we have to
>> leave
>> > drill to fix the problem. This works on small data, but may not work on
>> > large or wide data. Consider the  array at root level limitation.  What
>> if
>> > some process out of the users control produces 1000 100mb json files
>> and we
>> > want to read that. To fix it, we have to address those files. Lots of
>> work
>> > there, either manual or automated.
>> >
>> > c.     Once again I know this isn’t easy, but we shouldn’t answer
>> questions
>> > about how to do something by saying “fix this outside of Drill so Drill
>> can
>> > read your data” if at all possible.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > I hope this story helps support the design document presented.  I am
>> happy
>> > to participate in more discussion around these topics as I have enjoying
>> > digging into the internals of Drill
>> >
>> >
>> >
>> > John Omernik
>> >
>>
>>
>>
>> --
>> *Name: Ganesh Semalty*
>> *Location: Gurgaon,Haryana(India)*
>> *Email Id: g4ganeshsemalty@gmail.com <g4...@gmail.com>*
>>
>>
>> P
>>
>> *Please consider the environment before printing this e-mail - SAVE TREE.*
>>
>
>
>
> --
>
> *M. Engin Sözer*
> Junior Datawarehouse Manager
> mesoezer@goodgamestudios.com
>
> Goodgame Studios
> Theodorstr. 42-90, House 9
> 22761 Hamburg, Germany
> Phone: +49 (0)40 219 880 -0
> *www.goodgamestudios.com <http://www.goodgamestudios.com>*
>
> Goodgame Studios is a brand of Altigi GmbH
> Altigi GmbH, District court Hamburg, HRB 99869
> Board of directors: Dr. Kai Wawrzinek, Dr. Christian Wawrzinek, Fabian
> Ritter
>
>


-- 

*M. Engin Sözer*
Junior Datawarehouse Manager
mesoezer@goodgamestudios.com

Goodgame Studios
Theodorstr. 42-90, House 9
22761 Hamburg, Germany
Phone: +49 (0)40 219 880 -0
*www.goodgamestudios.com <http://www.goodgamestudios.com>*

Goodgame Studios is a brand of Altigi GmbH
Altigi GmbH, District court Hamburg, HRB 99869
Board of directors: Dr. Kai Wawrzinek, Dr. Christian Wawrzinek, Fabian
Ritter

Re: [Design Document] Support the Ability to Identify And Skip Records when Function Evaluations Fail

Posted by Mustafa Engin Sözer <me...@goodgamestudios.com>.
Hi,

If you only need to show the data but not some sort of report (e.g. as in
Microstrategy, Tableau), then you can basically use most SQL clients, for
instance SQuirrel, Heidi SQL etc. via jdbc or odbc connection . The
following links should help you manage that:

https://drill.apache.org/docs/odbc-jdbc-interfaces/
https://drill.apache.org/docs/using-the-jdbc-driver/

For the Hadoop data problem, can you share your storage plugin settings?
Normally you shouldn't have any problems with seeing or querying hdfs data.



On 22 October 2015 at 13:06, ganesh <g4...@gmail.com> wrote:

> Hello,
>
> John, you seem to be quite impressed with apache drill .. nice.
> I am new to un-structured world and just started 1 week back on APACHE
> DRILL after suggestion from my collegues. We have a semi structured data
> where we have constraint that we do not know number of columns
>
> I heard that APACHE DRILL is column free applicationa nd with support of
> JSON format, it allows to create columns on-fly,
> I converted my data from CSV-like-format to JSON and trying to figure out
> if it will work for me.
>
> Here I hit two issues :-
> 1) My column were like : 3100.2.1.2  and values like '-2303" or '01/01/2015
> 02:02:00"
>
> Challenge was that column cant be started with Numeric value. So I had to
> change key as: "t3100.2.1.2"
> After that things were quite OK,
>
> Now I need some help from you guys. To proceed I have to present my work to
> management as an example.
> But querying on apache drill console, doesnt seem to be an attractive way
> to present things.
>
> I tried drill explorer too.But didnt find that so good.
> One thing to note, I am playing with files on Hadoop standalone mode in
> UBUNTU.
>
> To make it appear more good looking, I started with QLIK SENSE .. but was
> unable to connect it with hadoop file system. It only showed me HIVE FILES.
> Then I downloaded TABLEAU Trial version ... but I am unable to get Hadoop
> data here too...
>
> Please help me how to proceed. I have presentation on coming Monday.
> Queries are quite ready .. I just need to show in visualization form
> ........ using OPEN SOURCE applications only.
>
>
> Guys please help me.
>
>
>
> On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <jo...@omernik.com> wrote:
>
> > AWESOME!
> >
> > I had just been in the process of writing up a long user story to ask for
> > and support exactly this.   I modified it and included it here:
> >
> >
> > To start out, I want to say how much I love the Drill project, and the
> > potential it has. I've put this together based on my experiences and want
> > to contribute a perspective as a user, not just put a bunch of critiques
> in
> > an email.  I hope it's all taken in that spirit.  Additional note, I
> wrote
> > this prior to seeing the Design Document share by Hsuan Yi Chu yesterday.
> > If you are reading it, and think to yourself “that wording is odd…”
> please
> > consider it from the “I didn’t want to throw away the user story”
> > perspective and the “I wrote it before the design doc” perspective.
> >
> >
> >
> > Additionally, I understand that some of what I am suggesting may not be
> > easy from a development perspective.  I am just being upfront with my
> > experience, so we can look to determine what can be done; I am not
> looking
> > for a silver bullet here, just looking for improvement.  Some may be as
> > simple as better documentation, other suggestions may be harder to
> > implement.  Either way, I thought a verbose user story might be useful to
> > the community as a whole.
> >
> >
> >
> > John
> >
> >
> >
> > *User Story*
> >
> >
> >
> > As I have been working with Drill for data exploration, I came across
> > multiple "things" that just were hard.  In dealing with some data,
> > especially JSON data, it can be ugly, and scaled ugly is even worse!
> >
> >
> >
> > For this story, I am working with a JSON dump from MongoDB, and you would
> > think it would be well structured, and for the most part it is.  There
> are
> > some application level mistakes that were made (I will go into that in a
> > moment), but in general Drill handles this well.  So with this data set,
> > there are a few main challenges I am seeing:
> >
> >
> >
> > 1.     When there is a field that has a float, and then a later record
> has
> > the number 0 in it (which Drill takes as a INT). This is a known problem
> > and one that Drill has a solution for.
> >
> > 2.     When there is a field is of one type (a map) and then a later
> record
> > has a string in it.  No easy solution here.
> >
> > 3.     Select * where there is a json field with a . in the name. I won’t
> > go into details here, but I feel this factors into data exploration,
> > because it changes the ability to “stay in Drill” to explore their data (
> > https://issues.apache.org/jira/browse/DRILL-3922)
> >
> > 4.     Error reporting challenges
> >
> >
> >
> >
> >
> > With the problem summary laid out, I wanted to walk through my process in
> > working with this data, and where, if I were a user Drill could have been
> > much more helpful to the process.
> >
> >
> >
> > Here is a description of the process I went through:
> >
> >
> >
> > 1.     Copy data into filesystem
> >
> > 2.     Use drill to “Select * from `path_to/dump.json` limit 1
> >
> > 3.     (I just want to see what it looks like!)
> >
> >
> >
> >
> >
> > Here I get this error:
> >
> >
> >
> > > select * from `path_to/ dump.json` limit 1;
> >
> > Error: DATA_READ ERROR: You tried to write a BigInt type when you are
> using
> > a ValueWriter of type NullableFloat8WriterImpl.
> >
> >
> >
> > File  /data/dev/path_to/dump.json
> >
> > Record  1
> >
> > Line  1
> >
> > Column  9054
> >
> > Field  entropy
> >
> > Fragment 0:0
> >
> >
> >
> > This isn’t incredibly helpful from a user perspective.  I.e. When I
> Google
> > around, I realize now that in the docs it talks about “Schema Changes”
> and
> > one possible item is use the setting below. However, examples of the data
> > that was trying to be displayed (with it’s implied type) may help users
> > grok what is happening.  At least in this case it showed me the field
> name!
> >
> >
> >
> > ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
> >
> >
> >
> > This is a great example where since we have known use case (when numbers
> > are doubles but someone tries to store 0 an INT) it fails, thus dev’s
> have
> > added a setting to allow a user to get through that, that the error
> message
> > could be more helpful.   In this case, Showing two record numbers (line
> > numbers) with different types, the field values with their implied types,
> > and perhaps a suggestion about using the setting to address the problem.
> > This could make it more intuitive for the user to stay in Drill, and stay
> > in the data.   In this case, I looked at a head of the file, and saw the
> > issue and was able to proceed.
> >
> >
> >
> > Also, as a corollary here, the user documentation does not show this
> error
> > related to the schema change problem. This would be a great place to
> state,
> > “if you see an error that looks like X, this is what is happening and
> what
> > you can do for it.”
> >
> >
> >
> >
> >
> > *Side node on documentation*
> >
> > We should look to have documentation try to be role based.   In this
> case,
> > the documentation says use “ALTER SYSTEM” I would argue, and I am
> guessing
> > others would concur, that for this use case, “ALTER SESSION” may be a
> > better suggestion as this is specific alteration to address the use case
> of
> > loading/querying a specific data set, and is likely done by a user of the
> > system.
> >
> >
> >
> > If a user is doing self-serve data, then in an enterprise environment,
> they
> > may not have the ability to use ALTER SYSTEM and get an error, thus may
> be
> > confused on how to proceed.   In addition ALTER SYSTEM by a user who
> > doesn’t understand that they are changing, yet have the rights to change,
> > may introduce future data problems they didn’t expect.   I like that the
> > default is a more constrictive method, because it makes people be
> explicit
> > about data, yet the documentation should also aim to be explicit about
> > something like a system wide change.
> >
> >
> >
> >
> >
> > *Back to the story*
> >
> > Ok so now I will do ALTER SESSION SET on the read_numbers_as_double
> setting
> >
> >
> >
> > I run the query again.
> >
> >
> >
> > > select * from `path_to/dump.json` limit 1;
> >
> > Error: DATA_READ ERROR: Error parsing JSON - You tried to write a VarChar
> > type when you are using a ValueWriter of type SingleMapWriter.
> >
> >
> >
> > File  /data/dev/path_to/dump.json
> >
> > Record  4009
> >
> > Fragment 0:0
> >
> >
> >
> > Another error   But what does this one mean? Ok, now that I have been
> > living in the docs and in the Drill user list, and because it’s similar
> to
> > the schema change issue, that that is what we are looking at here.
> Instead
> > of double to int, we have one field that is map most of the time, and in
> > some cases it’s a string.
> >
> >
> >
> > But this doesn’t really help me as a user.  To troubleshoot this Drill
> > doesn’t offer any options. This file is 500 MB of dense and nested JSON
> > data with 51k records.   My solution? I took the record number, then I
> went
> > to my NFS mounted clustered file system (thank goodness I had MapR here,
> I
> > am not sure how I would have done this with Posix tools)
> >
> >
> >
> > My command: $ head -4009 dump.json|tail -1
> >
> >
> >
> > That (I hoped) showed me the record in question, note the error from
> Drill
> > didn’t tell me which field was at fault here, so I had to visually align
> > things to address that.  However, I was able to spot the difference and
> > work with the dev to understand why that happened. I removed those
> records,
> > and things worked correctly.
> >
> >
> >
> > Could there have been a way to identify that within drill? My solution
> was
> > to take a python script and read through, and discard those records that
> > were not a map, however, on 500MB that can work, but what about 500 GB?
> I
> > guess a Spark job could clean the data…. But could Drill be given some
> > tools to help with this situation?
> >
> >
> >
> > For example, the first thing I said was: What field is at issue?  I had
> no
> > way to see what was up there.  I had to use other tools to see the data
> so
> > I could understand the problem. Then when I understood the problem, I had
> > to use Python to produce data that was queryable.
> >
> >
> >
> > Based on the design document Hsuan Yi Chu just posted to the mailing
> list,
> > at this point my post is just a user story to support the design
> document.
> > To summarize the points I’d like to see included in the design document
> > (from a user perspective), not understanding “how or why”:
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *1.     **Error messages that are more verbose in explaining the problem*
> >
> > a.     Filename, row number, column number or name
> >
> > b.     Option to output the “offending row”
> >
> > c.     Showing the data that is causing the error WITH the type Drill
> > inferred.
> >
> > d.     If there are options to help work through dirty data, perhaps the
> > error message could include those: “Data was an double, then drill found
> > this data: 0 that was a int in File x, at row 24 in column
> “myfloatingdata”
> > consider using store.json.read_numbers_as_double to address the issue.
> >
> > 2.     *A way to determine how common this exception is*
> >
> > a.     If I am playing with a messy data set, and this error happens,
> does
> > it happen on 1 record? 2? 5000?  Knowing that information would:
> >
> >                                                i.     Help users
> understand
> > how Drill is seeing that particular column
> >
> >                                              ii.     Make decisions on
> > excluding data rather than just removing it. What if the first 10 records
> > were errors, and then you excluded the remaining 10 million because they
> > were correct yet different from the first 10?
> >
> > b.     Perhaps there could be a “stats” function that only works if it’s
> > the only selected item or if the select is all those functions (stats
> > functions)?
> >
> >                                                i.     Select
> > type_stats(fieldsname) from data
> >
> >                                              ii.      (that wouldn’t
> error
> > on different types)
> >
> > 3.     *An ability to set a “return null on this field if error or if non
> > castable to X type, especially in a view, perhaps in a function.*
> >
> > a.     Allow them to not have to reparse data outside drill
> >
> > b.     Load it into a sane format (one time loads/ETL to clean data)
> >
> > c.     Not be system or session wide exception.
> >
> >                                                i.     I think this is
> > important because I may have a field where I want it to read the numbers
> as
> > double, but what if I have another field in the same dataset where I
> don’t
> > want it to read the numbers as double? A SYSTEM or SESSION level variable
> > takes away that granularity
> >
> > d.     Select field1, CASTORNULL(field2, int) as field2,
> CASTORNULL(field3,
> > double) as field3 from ugly_data.
> >
> > e.     That’s an example when it’s in the select, but I Could see a where
> > clause
> >
> > f.      Select field1, field2, field3 from ugly data where ISTYPE(field2,
> > int) and ISTYPE(field3, double)
> >
> > 4.     *Updating of the documentation related to ALTER SESSION vs ALTER
> > SYSTEM with an eye to the context of the majority use case of the
> > documented feature*
> >
> > a.     For data loads, the documentation uses ALTER SYSTEM and that’s
> > problematic because:
> >
> >                                                i.     Not all users have
> > the privileges to issue an ALTER SYSTEM. Thus a new user trying to figure
> > things out may not realize they can just ALTER SESSION after getting an
> > ALTER SYSTEM error.
> >
> >                                              ii.     ALTER SYSTEM on data
> > loading items, especially in areas that make Drill’s data interpretation
> > more permissive can lead to unintended consequences later. An admin, who
> > may be a good systems admin, and helps a data user troubleshoot and error
> > may issue an ALTER SYSTEM not realizing this changes all future data
> > imports.
> >
> > b.     Note, I found a few cases, but I would suggest a thorough review
> of
> > the various use cases throughout the documentation, and in areas where it
> > really could be either, have a small paragraph indicating the
> ramifications
> > of either command.
> >
> > *5.     **A Philosophy within the Drill Community to “Stay in Drill” for
> > data exploration*
> >
> > a.     This is obviously not as much of a development thing as a mindset.
> > If someone says “I tried to do X, and I got and error” and the
> communities
> > response is Y where Y is “Look through your data and do Z to it so Drill
> > can read it” then we should reconsider that scenario and try to provide
> and
> > option within Drill to intuitively handle the edge case.  This is
> > difficult.
> >
> > b.     There are cases even in the documentation where this is the case:
> > https://drill.apache.org/docs/json-data-model/ talking about arrays at
> the
> > root level or reading some empty arrays.  In these cases, we have to
> leave
> > drill to fix the problem. This works on small data, but may not work on
> > large or wide data. Consider the  array at root level limitation.  What
> if
> > some process out of the users control produces 1000 100mb json files and
> we
> > want to read that. To fix it, we have to address those files. Lots of
> work
> > there, either manual or automated.
> >
> > c.     Once again I know this isn’t easy, but we shouldn’t answer
> questions
> > about how to do something by saying “fix this outside of Drill so Drill
> can
> > read your data” if at all possible.
> >
> >
> >
> >
> >
> >
> >
> > I hope this story helps support the design document presented.  I am
> happy
> > to participate in more discussion around these topics as I have enjoying
> > digging into the internals of Drill
> >
> >
> >
> > John Omernik
> >
>
>
>
> --
> *Name: Ganesh Semalty*
> *Location: Gurgaon,Haryana(India)*
> *Email Id: g4ganeshsemalty@gmail.com <g4...@gmail.com>*
>
>
> P
>
> *Please consider the environment before printing this e-mail - SAVE TREE.*
>



-- 

*M. Engin Sözer*
Junior Datawarehouse Manager
mesoezer@goodgamestudios.com

Goodgame Studios
Theodorstr. 42-90, House 9
22761 Hamburg, Germany
Phone: +49 (0)40 219 880 -0
*www.goodgamestudios.com <http://www.goodgamestudios.com>*

Goodgame Studios is a brand of Altigi GmbH
Altigi GmbH, District court Hamburg, HRB 99869
Board of directors: Dr. Kai Wawrzinek, Dr. Christian Wawrzinek, Fabian
Ritter

Re: [Design Document] Support the Ability to Identify And Skip Records when Function Evaluations Fail

Posted by Ted Dunning <te...@gmail.com>.
On Thu, Oct 22, 2015 at 4:06 AM, ganesh <g4...@gmail.com> wrote:

> To make it appear more good looking, I started with QLIK SENSE .. but was
> unable to connect it with hadoop file system. It only showed me HIVE FILES.
> Then I downloaded TABLEAU Trial version ... but I am unable to get Hadoop
> data here too...
>

This might help:

https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/

Re: [Design Document] Support the Ability to Identify And Skip Records when Function Evaluations Fail

Posted by ganesh <g4...@gmail.com>.
Hello,

John, you seem to be quite impressed with apache drill .. nice.
I am new to un-structured world and just started 1 week back on APACHE
DRILL after suggestion from my collegues. We have a semi structured data
where we have constraint that we do not know number of columns

I heard that APACHE DRILL is column free applicationa nd with support of
JSON format, it allows to create columns on-fly,
I converted my data from CSV-like-format to JSON and trying to figure out
if it will work for me.

Here I hit two issues :-
1) My column were like : 3100.2.1.2  and values like '-2303" or '01/01/2015
02:02:00"

Challenge was that column cant be started with Numeric value. So I had to
change key as: "t3100.2.1.2"
After that things were quite OK,

Now I need some help from you guys. To proceed I have to present my work to
management as an example.
But querying on apache drill console, doesnt seem to be an attractive way
to present things.

I tried drill explorer too.But didnt find that so good.
One thing to note, I am playing with files on Hadoop standalone mode in
UBUNTU.

To make it appear more good looking, I started with QLIK SENSE .. but was
unable to connect it with hadoop file system. It only showed me HIVE FILES.
Then I downloaded TABLEAU Trial version ... but I am unable to get Hadoop
data here too...

Please help me how to proceed. I have presentation on coming Monday.
Queries are quite ready .. I just need to show in visualization form
........ using OPEN SOURCE applications only.


Guys please help me.



On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <jo...@omernik.com> wrote:

> AWESOME!
>
> I had just been in the process of writing up a long user story to ask for
> and support exactly this.   I modified it and included it here:
>
>
> To start out, I want to say how much I love the Drill project, and the
> potential it has. I've put this together based on my experiences and want
> to contribute a perspective as a user, not just put a bunch of critiques in
> an email.  I hope it's all taken in that spirit.  Additional note, I wrote
> this prior to seeing the Design Document share by Hsuan Yi Chu yesterday.
> If you are reading it, and think to yourself “that wording is odd…” please
> consider it from the “I didn’t want to throw away the user story”
> perspective and the “I wrote it before the design doc” perspective.
>
>
>
> Additionally, I understand that some of what I am suggesting may not be
> easy from a development perspective.  I am just being upfront with my
> experience, so we can look to determine what can be done; I am not looking
> for a silver bullet here, just looking for improvement.  Some may be as
> simple as better documentation, other suggestions may be harder to
> implement.  Either way, I thought a verbose user story might be useful to
> the community as a whole.
>
>
>
> John
>
>
>
> *User Story*
>
>
>
> As I have been working with Drill for data exploration, I came across
> multiple "things" that just were hard.  In dealing with some data,
> especially JSON data, it can be ugly, and scaled ugly is even worse!
>
>
>
> For this story, I am working with a JSON dump from MongoDB, and you would
> think it would be well structured, and for the most part it is.  There are
> some application level mistakes that were made (I will go into that in a
> moment), but in general Drill handles this well.  So with this data set,
> there are a few main challenges I am seeing:
>
>
>
> 1.     When there is a field that has a float, and then a later record has
> the number 0 in it (which Drill takes as a INT). This is a known problem
> and one that Drill has a solution for.
>
> 2.     When there is a field is of one type (a map) and then a later record
> has a string in it.  No easy solution here.
>
> 3.     Select * where there is a json field with a . in the name. I won’t
> go into details here, but I feel this factors into data exploration,
> because it changes the ability to “stay in Drill” to explore their data (
> https://issues.apache.org/jira/browse/DRILL-3922)
>
> 4.     Error reporting challenges
>
>
>
>
>
> With the problem summary laid out, I wanted to walk through my process in
> working with this data, and where, if I were a user Drill could have been
> much more helpful to the process.
>
>
>
> Here is a description of the process I went through:
>
>
>
> 1.     Copy data into filesystem
>
> 2.     Use drill to “Select * from `path_to/dump.json` limit 1
>
> 3.     (I just want to see what it looks like!)
>
>
>
>
>
> Here I get this error:
>
>
>
> > select * from `path_to/ dump.json` limit 1;
>
> Error: DATA_READ ERROR: You tried to write a BigInt type when you are using
> a ValueWriter of type NullableFloat8WriterImpl.
>
>
>
> File  /data/dev/path_to/dump.json
>
> Record  1
>
> Line  1
>
> Column  9054
>
> Field  entropy
>
> Fragment 0:0
>
>
>
> This isn’t incredibly helpful from a user perspective.  I.e. When I Google
> around, I realize now that in the docs it talks about “Schema Changes” and
> one possible item is use the setting below. However, examples of the data
> that was trying to be displayed (with it’s implied type) may help users
> grok what is happening.  At least in this case it showed me the field name!
>
>
>
> ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
>
>
>
> This is a great example where since we have known use case (when numbers
> are doubles but someone tries to store 0 an INT) it fails, thus dev’s have
> added a setting to allow a user to get through that, that the error message
> could be more helpful.   In this case, Showing two record numbers (line
> numbers) with different types, the field values with their implied types,
> and perhaps a suggestion about using the setting to address the problem.
> This could make it more intuitive for the user to stay in Drill, and stay
> in the data.   In this case, I looked at a head of the file, and saw the
> issue and was able to proceed.
>
>
>
> Also, as a corollary here, the user documentation does not show this error
> related to the schema change problem. This would be a great place to state,
> “if you see an error that looks like X, this is what is happening and what
> you can do for it.”
>
>
>
>
>
> *Side node on documentation*
>
> We should look to have documentation try to be role based.   In this case,
> the documentation says use “ALTER SYSTEM” I would argue, and I am guessing
> others would concur, that for this use case, “ALTER SESSION” may be a
> better suggestion as this is specific alteration to address the use case of
> loading/querying a specific data set, and is likely done by a user of the
> system.
>
>
>
> If a user is doing self-serve data, then in an enterprise environment, they
> may not have the ability to use ALTER SYSTEM and get an error, thus may be
> confused on how to proceed.   In addition ALTER SYSTEM by a user who
> doesn’t understand that they are changing, yet have the rights to change,
> may introduce future data problems they didn’t expect.   I like that the
> default is a more constrictive method, because it makes people be explicit
> about data, yet the documentation should also aim to be explicit about
> something like a system wide change.
>
>
>
>
>
> *Back to the story*
>
> Ok so now I will do ALTER SESSION SET on the read_numbers_as_double setting
>
>
>
> I run the query again.
>
>
>
> > select * from `path_to/dump.json` limit 1;
>
> Error: DATA_READ ERROR: Error parsing JSON - You tried to write a VarChar
> type when you are using a ValueWriter of type SingleMapWriter.
>
>
>
> File  /data/dev/path_to/dump.json
>
> Record  4009
>
> Fragment 0:0
>
>
>
> Another error   But what does this one mean? Ok, now that I have been
> living in the docs and in the Drill user list, and because it’s similar to
> the schema change issue, that that is what we are looking at here.  Instead
> of double to int, we have one field that is map most of the time, and in
> some cases it’s a string.
>
>
>
> But this doesn’t really help me as a user.  To troubleshoot this Drill
> doesn’t offer any options. This file is 500 MB of dense and nested JSON
> data with 51k records.   My solution? I took the record number, then I went
> to my NFS mounted clustered file system (thank goodness I had MapR here, I
> am not sure how I would have done this with Posix tools)
>
>
>
> My command: $ head -4009 dump.json|tail -1
>
>
>
> That (I hoped) showed me the record in question, note the error from Drill
> didn’t tell me which field was at fault here, so I had to visually align
> things to address that.  However, I was able to spot the difference and
> work with the dev to understand why that happened. I removed those records,
> and things worked correctly.
>
>
>
> Could there have been a way to identify that within drill? My solution was
> to take a python script and read through, and discard those records that
> were not a map, however, on 500MB that can work, but what about 500 GB?  I
> guess a Spark job could clean the data…. But could Drill be given some
> tools to help with this situation?
>
>
>
> For example, the first thing I said was: What field is at issue?  I had no
> way to see what was up there.  I had to use other tools to see the data so
> I could understand the problem. Then when I understood the problem, I had
> to use Python to produce data that was queryable.
>
>
>
> Based on the design document Hsuan Yi Chu just posted to the mailing list,
> at this point my post is just a user story to support the design document.
> To summarize the points I’d like to see included in the design document
> (from a user perspective), not understanding “how or why”:
>
>
>
>
>
>
>
>
>
> *1.     **Error messages that are more verbose in explaining the problem*
>
> a.     Filename, row number, column number or name
>
> b.     Option to output the “offending row”
>
> c.     Showing the data that is causing the error WITH the type Drill
> inferred.
>
> d.     If there are options to help work through dirty data, perhaps the
> error message could include those: “Data was an double, then drill found
> this data: 0 that was a int in File x, at row 24 in column “myfloatingdata”
> consider using store.json.read_numbers_as_double to address the issue.
>
> 2.     *A way to determine how common this exception is*
>
> a.     If I am playing with a messy data set, and this error happens, does
> it happen on 1 record? 2? 5000?  Knowing that information would:
>
>                                                i.     Help users understand
> how Drill is seeing that particular column
>
>                                              ii.     Make decisions on
> excluding data rather than just removing it. What if the first 10 records
> were errors, and then you excluded the remaining 10 million because they
> were correct yet different from the first 10?
>
> b.     Perhaps there could be a “stats” function that only works if it’s
> the only selected item or if the select is all those functions (stats
> functions)?
>
>                                                i.     Select
> type_stats(fieldsname) from data
>
>                                              ii.      (that wouldn’t error
> on different types)
>
> 3.     *An ability to set a “return null on this field if error or if non
> castable to X type, especially in a view, perhaps in a function.*
>
> a.     Allow them to not have to reparse data outside drill
>
> b.     Load it into a sane format (one time loads/ETL to clean data)
>
> c.     Not be system or session wide exception.
>
>                                                i.     I think this is
> important because I may have a field where I want it to read the numbers as
> double, but what if I have another field in the same dataset where I don’t
> want it to read the numbers as double? A SYSTEM or SESSION level variable
> takes away that granularity
>
> d.     Select field1, CASTORNULL(field2, int) as field2, CASTORNULL(field3,
> double) as field3 from ugly_data.
>
> e.     That’s an example when it’s in the select, but I Could see a where
> clause
>
> f.      Select field1, field2, field3 from ugly data where ISTYPE(field2,
> int) and ISTYPE(field3, double)
>
> 4.     *Updating of the documentation related to ALTER SESSION vs ALTER
> SYSTEM with an eye to the context of the majority use case of the
> documented feature*
>
> a.     For data loads, the documentation uses ALTER SYSTEM and that’s
> problematic because:
>
>                                                i.     Not all users have
> the privileges to issue an ALTER SYSTEM. Thus a new user trying to figure
> things out may not realize they can just ALTER SESSION after getting an
> ALTER SYSTEM error.
>
>                                              ii.     ALTER SYSTEM on data
> loading items, especially in areas that make Drill’s data interpretation
> more permissive can lead to unintended consequences later. An admin, who
> may be a good systems admin, and helps a data user troubleshoot and error
> may issue an ALTER SYSTEM not realizing this changes all future data
> imports.
>
> b.     Note, I found a few cases, but I would suggest a thorough review of
> the various use cases throughout the documentation, and in areas where it
> really could be either, have a small paragraph indicating the ramifications
> of either command.
>
> *5.     **A Philosophy within the Drill Community to “Stay in Drill” for
> data exploration*
>
> a.     This is obviously not as much of a development thing as a mindset.
> If someone says “I tried to do X, and I got and error” and the communities
> response is Y where Y is “Look through your data and do Z to it so Drill
> can read it” then we should reconsider that scenario and try to provide and
> option within Drill to intuitively handle the edge case.  This is
> difficult.
>
> b.     There are cases even in the documentation where this is the case:
> https://drill.apache.org/docs/json-data-model/ talking about arrays at the
> root level or reading some empty arrays.  In these cases, we have to leave
> drill to fix the problem. This works on small data, but may not work on
> large or wide data. Consider the  array at root level limitation.  What if
> some process out of the users control produces 1000 100mb json files and we
> want to read that. To fix it, we have to address those files. Lots of work
> there, either manual or automated.
>
> c.     Once again I know this isn’t easy, but we shouldn’t answer questions
> about how to do something by saying “fix this outside of Drill so Drill can
> read your data” if at all possible.
>
>
>
>
>
>
>
> I hope this story helps support the design document presented.  I am happy
> to participate in more discussion around these topics as I have enjoying
> digging into the internals of Drill
>
>
>
> John Omernik
>



-- 
*Name: Ganesh Semalty*
*Location: Gurgaon,Haryana(India)*
*Email Id: g4ganeshsemalty@gmail.com <g4...@gmail.com>*


P

*Please consider the environment before printing this e-mail - SAVE TREE.*

Re: [Design Document] Support the Ability to Identify And Skip Records when Function Evaluations Fail

Posted by John Omernik <jo...@omernik.com>.
AWESOME!

I had just been in the process of writing up a long user story to ask for
and support exactly this.   I modified it and included it here:


To start out, I want to say how much I love the Drill project, and the
potential it has. I've put this together based on my experiences and want
to contribute a perspective as a user, not just put a bunch of critiques in
an email.  I hope it's all taken in that spirit.  Additional note, I wrote
this prior to seeing the Design Document share by Hsuan Yi Chu yesterday.
If you are reading it, and think to yourself “that wording is odd…” please
consider it from the “I didn’t want to throw away the user story”
perspective and the “I wrote it before the design doc” perspective.



Additionally, I understand that some of what I am suggesting may not be
easy from a development perspective.  I am just being upfront with my
experience, so we can look to determine what can be done; I am not looking
for a silver bullet here, just looking for improvement.  Some may be as
simple as better documentation, other suggestions may be harder to
implement.  Either way, I thought a verbose user story might be useful to
the community as a whole.



John



*User Story*



As I have been working with Drill for data exploration, I came across
multiple "things" that just were hard.  In dealing with some data,
especially JSON data, it can be ugly, and scaled ugly is even worse!



For this story, I am working with a JSON dump from MongoDB, and you would
think it would be well structured, and for the most part it is.  There are
some application level mistakes that were made (I will go into that in a
moment), but in general Drill handles this well.  So with this data set,
there are a few main challenges I am seeing:



1.     When there is a field that has a float, and then a later record has
the number 0 in it (which Drill takes as a INT). This is a known problem
and one that Drill has a solution for.

2.     When there is a field is of one type (a map) and then a later record
has a string in it.  No easy solution here.

3.     Select * where there is a json field with a . in the name. I won’t
go into details here, but I feel this factors into data exploration,
because it changes the ability to “stay in Drill” to explore their data (
https://issues.apache.org/jira/browse/DRILL-3922)

4.     Error reporting challenges





With the problem summary laid out, I wanted to walk through my process in
working with this data, and where, if I were a user Drill could have been
much more helpful to the process.



Here is a description of the process I went through:



1.     Copy data into filesystem

2.     Use drill to “Select * from `path_to/dump.json` limit 1

3.     (I just want to see what it looks like!)





Here I get this error:



> select * from `path_to/ dump.json` limit 1;

Error: DATA_READ ERROR: You tried to write a BigInt type when you are using
a ValueWriter of type NullableFloat8WriterImpl.



File  /data/dev/path_to/dump.json

Record  1

Line  1

Column  9054

Field  entropy

Fragment 0:0



This isn’t incredibly helpful from a user perspective.  I.e. When I Google
around, I realize now that in the docs it talks about “Schema Changes” and
one possible item is use the setting below. However, examples of the data
that was trying to be displayed (with it’s implied type) may help users
grok what is happening.  At least in this case it showed me the field name!



ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;



This is a great example where since we have known use case (when numbers
are doubles but someone tries to store 0 an INT) it fails, thus dev’s have
added a setting to allow a user to get through that, that the error message
could be more helpful.   In this case, Showing two record numbers (line
numbers) with different types, the field values with their implied types,
and perhaps a suggestion about using the setting to address the problem.
This could make it more intuitive for the user to stay in Drill, and stay
in the data.   In this case, I looked at a head of the file, and saw the
issue and was able to proceed.



Also, as a corollary here, the user documentation does not show this error
related to the schema change problem. This would be a great place to state,
“if you see an error that looks like X, this is what is happening and what
you can do for it.”





*Side node on documentation*

We should look to have documentation try to be role based.   In this case,
the documentation says use “ALTER SYSTEM” I would argue, and I am guessing
others would concur, that for this use case, “ALTER SESSION” may be a
better suggestion as this is specific alteration to address the use case of
loading/querying a specific data set, and is likely done by a user of the
system.



If a user is doing self-serve data, then in an enterprise environment, they
may not have the ability to use ALTER SYSTEM and get an error, thus may be
confused on how to proceed.   In addition ALTER SYSTEM by a user who
doesn’t understand that they are changing, yet have the rights to change,
may introduce future data problems they didn’t expect.   I like that the
default is a more constrictive method, because it makes people be explicit
about data, yet the documentation should also aim to be explicit about
something like a system wide change.





*Back to the story*

Ok so now I will do ALTER SESSION SET on the read_numbers_as_double setting



I run the query again.



> select * from `path_to/dump.json` limit 1;

Error: DATA_READ ERROR: Error parsing JSON - You tried to write a VarChar
type when you are using a ValueWriter of type SingleMapWriter.



File  /data/dev/path_to/dump.json

Record  4009

Fragment 0:0



Another error   But what does this one mean? Ok, now that I have been
living in the docs and in the Drill user list, and because it’s similar to
the schema change issue, that that is what we are looking at here.  Instead
of double to int, we have one field that is map most of the time, and in
some cases it’s a string.



But this doesn’t really help me as a user.  To troubleshoot this Drill
doesn’t offer any options. This file is 500 MB of dense and nested JSON
data with 51k records.   My solution? I took the record number, then I went
to my NFS mounted clustered file system (thank goodness I had MapR here, I
am not sure how I would have done this with Posix tools)



My command: $ head -4009 dump.json|tail -1



That (I hoped) showed me the record in question, note the error from Drill
didn’t tell me which field was at fault here, so I had to visually align
things to address that.  However, I was able to spot the difference and
work with the dev to understand why that happened. I removed those records,
and things worked correctly.



Could there have been a way to identify that within drill? My solution was
to take a python script and read through, and discard those records that
were not a map, however, on 500MB that can work, but what about 500 GB?  I
guess a Spark job could clean the data…. But could Drill be given some
tools to help with this situation?



For example, the first thing I said was: What field is at issue?  I had no
way to see what was up there.  I had to use other tools to see the data so
I could understand the problem. Then when I understood the problem, I had
to use Python to produce data that was queryable.



Based on the design document Hsuan Yi Chu just posted to the mailing list,
at this point my post is just a user story to support the design document.
To summarize the points I’d like to see included in the design document
(from a user perspective), not understanding “how or why”:









*1.     **Error messages that are more verbose in explaining the problem*

a.     Filename, row number, column number or name

b.     Option to output the “offending row”

c.     Showing the data that is causing the error WITH the type Drill
inferred.

d.     If there are options to help work through dirty data, perhaps the
error message could include those: “Data was an double, then drill found
this data: 0 that was a int in File x, at row 24 in column “myfloatingdata”
consider using store.json.read_numbers_as_double to address the issue.

2.     *A way to determine how common this exception is*

a.     If I am playing with a messy data set, and this error happens, does
it happen on 1 record? 2? 5000?  Knowing that information would:

                                               i.     Help users understand
how Drill is seeing that particular column

                                             ii.     Make decisions on
excluding data rather than just removing it. What if the first 10 records
were errors, and then you excluded the remaining 10 million because they
were correct yet different from the first 10?

b.     Perhaps there could be a “stats” function that only works if it’s
the only selected item or if the select is all those functions (stats
functions)?

                                               i.     Select
type_stats(fieldsname) from data

                                             ii.      (that wouldn’t error
on different types)

3.     *An ability to set a “return null on this field if error or if non
castable to X type, especially in a view, perhaps in a function.*

a.     Allow them to not have to reparse data outside drill

b.     Load it into a sane format (one time loads/ETL to clean data)

c.     Not be system or session wide exception.

                                               i.     I think this is
important because I may have a field where I want it to read the numbers as
double, but what if I have another field in the same dataset where I don’t
want it to read the numbers as double? A SYSTEM or SESSION level variable
takes away that granularity

d.     Select field1, CASTORNULL(field2, int) as field2, CASTORNULL(field3,
double) as field3 from ugly_data.

e.     That’s an example when it’s in the select, but I Could see a where
clause

f.      Select field1, field2, field3 from ugly data where ISTYPE(field2,
int) and ISTYPE(field3, double)

4.     *Updating of the documentation related to ALTER SESSION vs ALTER
SYSTEM with an eye to the context of the majority use case of the
documented feature*

a.     For data loads, the documentation uses ALTER SYSTEM and that’s
problematic because:

                                               i.     Not all users have
the privileges to issue an ALTER SYSTEM. Thus a new user trying to figure
things out may not realize they can just ALTER SESSION after getting an
ALTER SYSTEM error.

                                             ii.     ALTER SYSTEM on data
loading items, especially in areas that make Drill’s data interpretation
more permissive can lead to unintended consequences later. An admin, who
may be a good systems admin, and helps a data user troubleshoot and error
may issue an ALTER SYSTEM not realizing this changes all future data
imports.

b.     Note, I found a few cases, but I would suggest a thorough review of
the various use cases throughout the documentation, and in areas where it
really could be either, have a small paragraph indicating the ramifications
of either command.

*5.     **A Philosophy within the Drill Community to “Stay in Drill” for
data exploration*

a.     This is obviously not as much of a development thing as a mindset.
If someone says “I tried to do X, and I got and error” and the communities
response is Y where Y is “Look through your data and do Z to it so Drill
can read it” then we should reconsider that scenario and try to provide and
option within Drill to intuitively handle the edge case.  This is
difficult.

b.     There are cases even in the documentation where this is the case:
https://drill.apache.org/docs/json-data-model/ talking about arrays at the
root level or reading some empty arrays.  In these cases, we have to leave
drill to fix the problem. This works on small data, but may not work on
large or wide data. Consider the  array at root level limitation.  What if
some process out of the users control produces 1000 100mb json files and we
want to read that. To fix it, we have to address those files. Lots of work
there, either manual or automated.

c.     Once again I know this isn’t easy, but we shouldn’t answer questions
about how to do something by saying “fix this outside of Drill so Drill can
read your data” if at all possible.







I hope this story helps support the design document presented.  I am happy
to participate in more discussion around these topics as I have enjoying
digging into the internals of Drill



John Omernik