You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by John Omernik <jo...@omernik.com> on 2015/10/22 15:13:45 UTC

New Topic: Drill Visualization and Use Support

I separated my response from the original topic to keep any responses there
focused on the design document.

As to ways to use Drill, I have been working with SQL Squirrel quite
successfully.  On my list of things to do, if you want to stay in the
Apache world is looking at Apache Zeppelin.  In the Git Repo, there is a
Drill Plugin so you can run SQL again Drill, look at results, and do basic
visualizations, I have been trying to way until the PR is merged on
Zeppelin, but for your use case, you may want to grab the plugin code and
run with it.




On Thu, Oct 22, 2015 at 6:06 AM, ganesh <g4...@gmail.com> wrote:

> Hello,
>
> John, you seem to be quite impressed with apache drill .. nice.
> I am new to un-structured world and just started 1 week back on APACHE
> DRILL after suggestion from my collegues. We have a semi structured data
> where we have constraint that we do not know number of columns
>
> I heard that APACHE DRILL is column free applicationa nd with support of
> JSON format, it allows to create columns on-fly,
> I converted my data from CSV-like-format to JSON and trying to figure out
> if it will work for me.
>
> Here I hit two issues :-
> 1) My column were like : 3100.2.1.2  and values like '-2303" or
> '01/01/2015
> 02:02:00"
>
> Challenge was that column cant be started with Numeric value. So I had to
> change key as: "t3100.2.1.2"
> After that things were quite OK,
>
> Now I need some help from you guys. To proceed I have to present my work to
> management as an example.
> But querying on apache drill console, doesnt seem to be an attractive way
> to present things.
>
> I tried drill explorer too.But didnt find that so good.
> One thing to note, I am playing with files on Hadoop standalone mode in
> UBUNTU.
>
> To make it appear more good looking, I started with QLIK SENSE .. but was
> unable to connect it with hadoop file system. It only showed me HIVE FILES.
> Then I downloaded TABLEAU Trial version ... but I am unable to get Hadoop
> data here too...
>
> Please help me how to proceed. I have presentation on coming Monday.
> Queries are quite ready .. I just need to show in visualization form
> ........ using OPEN SOURCE applications only.
>
>
> Guys please help me.
>
>
>
> On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <jo...@omernik.com> wrote:
>
> > AWESOME!
> >
> > I had just been in the process of writing up a long user story to ask for
> > and support exactly this.   I modified it and included it here:
> >
> >
> > To start out, I want to say how much I love the Drill project, and the
> > potential it has. I've put this together based on my experiences and want
> > to contribute a perspective as a user, not just put a bunch of critiques
> in
> > an email.  I hope it's all taken in that spirit.  Additional note, I
> wrote
> > this prior to seeing the Design Document share by Hsuan Yi Chu yesterday.
> > If you are reading it, and think to yourself “that wording is odd…”
> please
> > consider it from the “I didn’t want to throw away the user story”
> > perspective and the “I wrote it before the design doc” perspective.
> >
> >
> >
> > Additionally, I understand that some of what I am suggesting may not be
> > easy from a development perspective.  I am just being upfront with my
> > experience, so we can look to determine what can be done; I am not
> looking
> > for a silver bullet here, just looking for improvement.  Some may be as
> > simple as better documentation, other suggestions may be harder to
> > implement.  Either way, I thought a verbose user story might be useful to
> > the community as a whole.
> >
> >
> >
> > John
> >
> >
> >
> > *User Story*
> >
> >
> >
> > As I have been working with Drill for data exploration, I came across
> > multiple "things" that just were hard.  In dealing with some data,
> > especially JSON data, it can be ugly, and scaled ugly is even worse!
> >
> >
> >
> > For this story, I am working with a JSON dump from MongoDB, and you would
> > think it would be well structured, and for the most part it is.  There
> are
> > some application level mistakes that were made (I will go into that in a
> > moment), but in general Drill handles this well.  So with this data set,
> > there are a few main challenges I am seeing:
> >
> >
> >
> > 1.     When there is a field that has a float, and then a later record
> has
> > the number 0 in it (which Drill takes as a INT). This is a known problem
> > and one that Drill has a solution for.
> >
> > 2.     When there is a field is of one type (a map) and then a later
> record
> > has a string in it.  No easy solution here.
> >
> > 3.     Select * where there is a json field with a . in the name. I won’t
> > go into details here, but I feel this factors into data exploration,
> > because it changes the ability to “stay in Drill” to explore their data (
> > https://issues.apache.org/jira/browse/DRILL-3922)
> >
> > 4.     Error reporting challenges
> >
> >
> >
> >
> >
> > With the problem summary laid out, I wanted to walk through my process in
> > working with this data, and where, if I were a user Drill could have been
> > much more helpful to the process.
> >
> >
> >
> > Here is a description of the process I went through:
> >
> >
> >
> > 1.     Copy data into filesystem
> >
> > 2.     Use drill to “Select * from `path_to/dump.json` limit 1
> >
> > 3.     (I just want to see what it looks like!)
> >
> >
> >
> >
> >
> > Here I get this error:
> >
> >
> >
> > > select * from `path_to/ dump.json` limit 1;
> >
> > Error: DATA_READ ERROR: You tried to write a BigInt type when you are
> using
> > a ValueWriter of type NullableFloat8WriterImpl.
> >
> >
> >
> > File  /data/dev/path_to/dump.json
> >
> > Record  1
> >
> > Line  1
> >
> > Column  9054
> >
> > Field  entropy
> >
> > Fragment 0:0
> >
> >
> >
> > This isn’t incredibly helpful from a user perspective.  I.e. When I
> Google
> > around, I realize now that in the docs it talks about “Schema Changes”
> and
> > one possible item is use the setting below. However, examples of the data
> > that was trying to be displayed (with it’s implied type) may help users
> > grok what is happening.  At least in this case it showed me the field
> name!
> >
> >
> >
> > ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
> >
> >
> >
> > This is a great example where since we have known use case (when numbers
> > are doubles but someone tries to store 0 an INT) it fails, thus dev’s
> have
> > added a setting to allow a user to get through that, that the error
> message
> > could be more helpful.   In this case, Showing two record numbers (line
> > numbers) with different types, the field values with their implied types,
> > and perhaps a suggestion about using the setting to address the problem.
> > This could make it more intuitive for the user to stay in Drill, and stay
> > in the data.   In this case, I looked at a head of the file, and saw the
> > issue and was able to proceed.
> >
> >
> >
> > Also, as a corollary here, the user documentation does not show this
> error
> > related to the schema change problem. This would be a great place to
> state,
> > “if you see an error that looks like X, this is what is happening and
> what
> > you can do for it.”
> >
> >
> >
> >
> >
> > *Side node on documentation*
> >
> > We should look to have documentation try to be role based.   In this
> case,
> > the documentation says use “ALTER SYSTEM” I would argue, and I am
> guessing
> > others would concur, that for this use case, “ALTER SESSION” may be a
> > better suggestion as this is specific alteration to address the use case
> of
> > loading/querying a specific data set, and is likely done by a user of the
> > system.
> >
> >
> >
> > If a user is doing self-serve data, then in an enterprise environment,
> they
> > may not have the ability to use ALTER SYSTEM and get an error, thus may
> be
> > confused on how to proceed.   In addition ALTER SYSTEM by a user who
> > doesn’t understand that they are changing, yet have the rights to change,
> > may introduce future data problems they didn’t expect.   I like that the
> > default is a more constrictive method, because it makes people be
> explicit
> > about data, yet the documentation should also aim to be explicit about
> > something like a system wide change.
> >
> >
> >
> >
> >
> > *Back to the story*
> >
> > Ok so now I will do ALTER SESSION SET on the read_numbers_as_double
> setting
> >
> >
> >
> > I run the query again.
> >
> >
> >
> > > select * from `path_to/dump.json` limit 1;
> >
> > Error: DATA_READ ERROR: Error parsing JSON - You tried to write a VarChar
> > type when you are using a ValueWriter of type SingleMapWriter.
> >
> >
> >
> > File  /data/dev/path_to/dump.json
> >
> > Record  4009
> >
> > Fragment 0:0
> >
> >
> >
> > Another error   But what does this one mean? Ok, now that I have been
> > living in the docs and in the Drill user list, and because it’s similar
> to
> > the schema change issue, that that is what we are looking at here.
> Instead
> > of double to int, we have one field that is map most of the time, and in
> > some cases it’s a string.
> >
> >
> >
> > But this doesn’t really help me as a user.  To troubleshoot this Drill
> > doesn’t offer any options. This file is 500 MB of dense and nested JSON
> > data with 51k records.   My solution? I took the record number, then I
> went
> > to my NFS mounted clustered file system (thank goodness I had MapR here,
> I
> > am not sure how I would have done this with Posix tools)
> >
> >
> >
> > My command: $ head -4009 dump.json|tail -1
> >
> >
> >
> > That (I hoped) showed me the record in question, note the error from
> Drill
> > didn’t tell me which field was at fault here, so I had to visually align
> > things to address that.  However, I was able to spot the difference and
> > work with the dev to understand why that happened. I removed those
> records,
> > and things worked correctly.
> >
> >
> >
> > Could there have been a way to identify that within drill? My solution
> was
> > to take a python script and read through, and discard those records that
> > were not a map, however, on 500MB that can work, but what about 500 GB?
> I
> > guess a Spark job could clean the data…. But could Drill be given some
> > tools to help with this situation?
> >
> >
> >
> > For example, the first thing I said was: What field is at issue?  I had
> no
> > way to see what was up there.  I had to use other tools to see the data
> so
> > I could understand the problem. Then when I understood the problem, I had
> > to use Python to produce data that was queryable.
> >
> >
> >
> > Based on the design document Hsuan Yi Chu just posted to the mailing
> list,
> > at this point my post is just a user story to support the design
> document.
> > To summarize the points I’d like to see included in the design document
> > (from a user perspective), not understanding “how or why”:
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *1.     **Error messages that are more verbose in explaining the problem*
> >
> > a.     Filename, row number, column number or name
> >
> > b.     Option to output the “offending row”
> >
> > c.     Showing the data that is causing the error WITH the type Drill
> > inferred.
> >
> > d.     If there are options to help work through dirty data, perhaps the
> > error message could include those: “Data was an double, then drill found
> > this data: 0 that was a int in File x, at row 24 in column
> “myfloatingdata”
> > consider using store.json.read_numbers_as_double to address the issue.
> >
> > 2.     *A way to determine how common this exception is*
> >
> > a.     If I am playing with a messy data set, and this error happens,
> does
> > it happen on 1 record? 2? 5000?  Knowing that information would:
> >
> >                                                i.     Help users
> understand
> > how Drill is seeing that particular column
> >
> >                                              ii.     Make decisions on
> > excluding data rather than just removing it. What if the first 10 records
> > were errors, and then you excluded the remaining 10 million because they
> > were correct yet different from the first 10?
> >
> > b.     Perhaps there could be a “stats” function that only works if it’s
> > the only selected item or if the select is all those functions (stats
> > functions)?
> >
> >                                                i.     Select
> > type_stats(fieldsname) from data
> >
> >                                              ii.      (that wouldn’t
> error
> > on different types)
> >
> > 3.     *An ability to set a “return null on this field if error or if non
> > castable to X type, especially in a view, perhaps in a function.*
> >
> > a.     Allow them to not have to reparse data outside drill
> >
> > b.     Load it into a sane format (one time loads/ETL to clean data)
> >
> > c.     Not be system or session wide exception.
> >
> >                                                i.     I think this is
> > important because I may have a field where I want it to read the numbers
> as
> > double, but what if I have another field in the same dataset where I
> don’t
> > want it to read the numbers as double? A SYSTEM or SESSION level variable
> > takes away that granularity
> >
> > d.     Select field1, CASTORNULL(field2, int) as field2,
> CASTORNULL(field3,
> > double) as field3 from ugly_data.
> >
> > e.     That’s an example when it’s in the select, but I Could see a where
> > clause
> >
> > f.      Select field1, field2, field3 from ugly data where ISTYPE(field2,
> > int) and ISTYPE(field3, double)
> >
> > 4.     *Updating of the documentation related to ALTER SESSION vs ALTER
> > SYSTEM with an eye to the context of the majority use case of the
> > documented feature*
> >
> > a.     For data loads, the documentation uses ALTER SYSTEM and that’s
> > problematic because:
> >
> >                                                i.     Not all users have
> > the privileges to issue an ALTER SYSTEM. Thus a new user trying to figure
> > things out may not realize they can just ALTER SESSION after getting an
> > ALTER SYSTEM error.
> >
> >                                              ii.     ALTER SYSTEM on data
> > loading items, especially in areas that make Drill’s data interpretation
> > more permissive can lead to unintended consequences later. An admin, who
> > may be a good systems admin, and helps a data user troubleshoot and error
> > may issue an ALTER SYSTEM not realizing this changes all future data
> > imports.
> >
> > b.     Note, I found a few cases, but I would suggest a thorough review
> of
> > the various use cases throughout the documentation, and in areas where it
> > really could be either, have a small paragraph indicating the
> ramifications
> > of either command.
> >
> > *5.     **A Philosophy within the Drill Community to “Stay in Drill” for
> > data exploration*
> >
> > a.     This is obviously not as much of a development thing as a mindset.
> > If someone says “I tried to do X, and I got and error” and the
> communities
> > response is Y where Y is “Look through your data and do Z to it so Drill
> > can read it” then we should reconsider that scenario and try to provide
> and
> > option within Drill to intuitively handle the edge case.  This is
> > difficult.
> >
> > b.     There are cases even in the documentation where this is the case:
> > https://drill.apache.org/docs/json-data-model/ talking about arrays at
> the
> > root level or reading some empty arrays.  In these cases, we have to
> leave
> > drill to fix the problem. This works on small data, but may not work on
> > large or wide data. Consider the  array at root level limitation.  What
> if
> > some process out of the users control produces 1000 100mb json files and
> we
> > want to read that. To fix it, we have to address those files. Lots of
> work
> > there, either manual or automated.
> >
> > c.     Once again I know this isn’t easy, but we shouldn’t answer
> questions
> > about how to do something by saying “fix this outside of Drill so Drill
> can
> > read your data” if at all possible.
> >
> >
> >
> >
> >
> >
> >
> > I hope this story helps support the design document presented.  I am
> happy
> > to participate in more discussion around these topics as I have enjoying
> > digging into the internals of Drill
> >
> >
> >
> > John Omernik
> >
>
>
>
> --
> *Name: Ganesh Semalty*
> *Location: Gurgaon,Haryana(India)*
> *Email Id: g4ganeshsemalty@gmail.com <g4...@gmail.com>*
>
>
> P
>
> *Please consider the environment before printing this e-mail - SAVE TREE.*
>

Re: New Topic: Drill Visualization and Use Support

Posted by Tom Barber <to...@meteorite.bi>.

Hi chaps

If you are looking for open source data vis you could use Pentaho BI
server, Spago BI, Saiku Analytics, to name a few.

They are all open source just not under the Apache umbrella.

(I do write saiku, this is not an endorsement just a selection of tools we
use in the open source business intelligence market regularly )

Regards

Tom
On 24 Oct 2015 11:27, "ganesh" <g4...@gmail.com> wrote:

> Hi,
>
> Regarding apache zeppelin: I think currently it does not support Apache
> drill connectivity by default. I saw following on a site:
>
> https://wiki.apache.org/incubator/ZeppelinProposal
> Initial Goals The initial goals will be to move the existing codebase to
> Apache and integrate with the Apache development process. This includes
> moving all infrastructure that we currently maintain, such as: a website, a
> mailing list, an issues tracker and a Jenkins CI, as mentioned in “Required
> Resources” section of current proposal. Once this is accomplished, *we plan
> for incremental development and releases that follow the Apache guidelines.
> To increase adoption the major goal for the project would be to provide
> integration with as much projects from Apache data ecosystem as possible,
> including new interpreters for Apache Hive, Apache Drill and adding
> Zeppelin distribution to Apache Bigtop*.
>
> Currently I was not able to test much with Tableau .. Didnt get much time
> to spend + whatever I tried till now hasnt been fruitfull with Tableau.
> Will spend some time more ...
>
>
>
> On Thu, Oct 22, 2015 at 8:16 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
>
> > See if the videos on this page helps you.
> >
> > https://www.mapr.com/products/apache-drill <
> > https://www.mapr.com/products/apache-drill>
> >
> >
> >
> >
> > —Andries
> >
> >
> > > On Oct 22, 2015, at 7:38 AM, ganesh <g4...@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > Are there any other links as tutorial for TABLEAU v/s HIVE
> > >
> > > I have already gone through the one in apache-drill site. I am not able
> > to
> > > proceed with those.
> > >
> > > On Thu, Oct 22, 2015 at 8:03 PM, Andries Engelbrecht <
> > > aengelbrecht@maprtech.com> wrote:
> > >
> > >> Hive should be visible and usable in Tableau. You can use Drill Views
> > for
> > >> dfs data, or you can Tableau Custom SQL to access the data.
> > >>
> > >> Make sure to install the Tableau TDC file that comes with the ODBC
> > driver.
> > >> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/ <
> > >> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/>
> > >>
> > https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/
> <
> > >>
> > https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/
> >
> > >>
> > >>
> > >> —Andries
> > >>
> > >>
> > >>> On Oct 22, 2015, at 6:40 AM, ganesh <g4...@gmail.com>
> wrote:
> > >>>
> > >>> Hi John,
> > >>>
> > >>> Thanks for suggesting new name:  Apache Zeppelin
> > >>>
> > >>> I was currently trying 14 Days trial version of TABLEAU with not much
> > >> success.
> > >>> Today only I knew that for files in Hadoop or local file system, I
> > would
> > >> need to create view.
> > >>>
> > >>> Still, though I can see my Tables from HIVE in Tableau, I cannot see
> > any
> > >> data.
> > >>> Nor I am able to use TABLEAU from the links help given in
> apache-drill
> > >> currently (http://drill.apache.org/docs/tableau-examples/ <
> > >> http://drill.apache.org/docs/tableau-examples/>)
> > >>>
> > >>> Snapshot attached, Incase you have worked over TABLEAU
> > >>>
> > >>> I will look into Zeppelin also.
> > >>>
> > >>>
> > >>> On Thu, Oct 22, 2015 at 6:43 PM, John Omernik <john@omernik.com
> > <mailto:
> > >> john@omernik.com>> wrote:
> > >>> I separated my response from the original topic to keep any responses
> > >> there
> > >>> focused on the design document.
> > >>>
> > >>> As to ways to use Drill, I have been working with SQL Squirrel quite
> > >>> successfully.  On my list of things to do, if you want to stay in the
> > >>> Apache world is looking at Apache Zeppelin.  In the Git Repo, there
> is
> > a
> > >>> Drill Plugin so you can run SQL again Drill, look at results, and do
> > >> basic
> > >>> visualizations, I have been trying to way until the PR is merged on
> > >>> Zeppelin, but for your use case, you may want to grab the plugin code
> > and
> > >>> run with it.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Thu, Oct 22, 2015 at 6:06 AM, ganesh <g4ganeshsemalty@gmail.com
> > >> <ma...@gmail.com>> wrote:
> > >>>
> > >>>> Hello,
> > >>>>
> > >>>> John, you seem to be quite impressed with apache drill .. nice.
> > >>>> I am new to un-structured world and just started 1 week back on
> APACHE
> > >>>> DRILL after suggestion from my collegues. We have a semi structured
> > >> data
> > >>>> where we have constraint that we do not know number of columns
> > >>>>
> > >>>> I heard that APACHE DRILL is column free applicationa nd with
> support
> > >> of
> > >>>> JSON format, it allows to create columns on-fly,
> > >>>> I converted my data from CSV-like-format to JSON and trying to
> figure
> > >> out
> > >>>> if it will work for me.
> > >>>>
> > >>>> Here I hit two issues :-
> > >>>> 1) My column were like : 3100.2.1.2 <tel:3100.2.1.2>  and values
> like
> > >> '-2303" or
> > >>>> '01/01/2015
> > >>>> 02:02:00"
> > >>>>
> > >>>> Challenge was that column cant be started with Numeric value. So I
> had
> > >> to
> > >>>> change key as: "t3100.2.1.2"
> > >>>> After that things were quite OK,
> > >>>>
> > >>>> Now I need some help from you guys. To proceed I have to present my
> > >> work to
> > >>>> management as an example.
> > >>>> But querying on apache drill console, doesnt seem to be an
> attractive
> > >> way
> > >>>> to present things.
> > >>>>
> > >>>> I tried drill explorer too.But didnt find that so good.
> > >>>> One thing to note, I am playing with files on Hadoop standalone mode
> > in
> > >>>> UBUNTU.
> > >>>>
> > >>>> To make it appear more good looking, I started with QLIK SENSE ..
> but
> > >> was
> > >>>> unable to connect it with hadoop file system. It only showed me HIVE
> > >> FILES.
> > >>>> Then I downloaded TABLEAU Trial version ... but I am unable to get
> > >> Hadoop
> > >>>> data here too...
> > >>>>
> > >>>> Please help me how to proceed. I have presentation on coming Monday.
> > >>>> Queries are quite ready .. I just need to show in visualization form
> > >>>> ........ using OPEN SOURCE applications only.
> > >>>>
> > >>>>
> > >>>> Guys please help me.
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <john@omernik.com
> > >> <ma...@omernik.com>> wrote:
> > >>>>
> > >>>>> AWESOME!
> > >>>>>
> > >>>>> I had just been in the process of writing up a long user story to
> > >> ask for
> > >>>>> and support exactly this.   I modified it and included it here:
> > >>>>>
> > >>>>>
> > >>>>> To start out, I want to say how much I love the Drill project, and
> > >> the
> > >>>>> potential it has. I've put this together based on my experiences
> and
> > >> want
> > >>>>> to contribute a perspective as a user, not just put a bunch of
> > >> critiques
> > >>>> in
> > >>>>> an email.  I hope it's all taken in that spirit.  Additional note,
> I
> > >>>> wrote
> > >>>>> this prior to seeing the Design Document share by Hsuan Yi Chu
> > >> yesterday.
> > >>>>> If you are reading it, and think to yourself “that wording is odd…”
> > >>>> please
> > >>>>> consider it from the “I didn’t want to throw away the user story”
> > >>>>> perspective and the “I wrote it before the design doc” perspective.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Additionally, I understand that some of what I am suggesting may
> not
> > >> be
> > >>>>> easy from a development perspective.  I am just being upfront with
> my
> > >>>>> experience, so we can look to determine what can be done; I am not
> > >>>> looking
> > >>>>> for a silver bullet here, just looking for improvement.  Some may
> be
> > >> as
> > >>>>> simple as better documentation, other suggestions may be harder to
> > >>>>> implement.  Either way, I thought a verbose user story might be
> > >> useful to
> > >>>>> the community as a whole.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> John
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> *User Story*
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> As I have been working with Drill for data exploration, I came
> across
> > >>>>> multiple "things" that just were hard.  In dealing with some data,
> > >>>>> especially JSON data, it can be ugly, and scaled ugly is even
> worse!
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> For this story, I am working with a JSON dump from MongoDB, and you
> > >> would
> > >>>>> think it would be well structured, and for the most part it is.
> > >> There
> > >>>> are
> > >>>>> some application level mistakes that were made (I will go into that
> > >> in a
> > >>>>> moment), but in general Drill handles this well.  So with this data
> > >> set,
> > >>>>> there are a few main challenges I am seeing:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> 1.     When there is a field that has a float, and then a later
> > >> record
> > >>>> has
> > >>>>> the number 0 in it (which Drill takes as a INT). This is a known
> > >> problem
> > >>>>> and one that Drill has a solution for.
> > >>>>>
> > >>>>> 2.     When there is a field is of one type (a map) and then a
> later
> > >>>> record
> > >>>>> has a string in it.  No easy solution here.
> > >>>>>
> > >>>>> 3.     Select * where there is a json field with a . in the name. I
> > >> won’t
> > >>>>> go into details here, but I feel this factors into data
> exploration,
> > >>>>> because it changes the ability to “stay in Drill” to explore their
> > >> data (
> > >>>>> https://issues.apache.org/jira/browse/DRILL-3922 <
> > >> https://issues.apache.org/jira/browse/DRILL-3922>)
> > >>>>>
> > >>>>> 4.     Error reporting challenges
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> With the problem summary laid out, I wanted to walk through my
> > >> process in
> > >>>>> working with this data, and where, if I were a user Drill could
> have
> > >> been
> > >>>>> much more helpful to the process.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Here is a description of the process I went through:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> 1.     Copy data into filesystem
> > >>>>>
> > >>>>> 2.     Use drill to “Select * from `path_to/dump.json` limit 1
> > >>>>>
> > >>>>> 3.     (I just want to see what it looks like!)
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Here I get this error:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>> select * from `path_to/ dump.json` limit 1;
> > >>>>>
> > >>>>> Error: DATA_READ ERROR: You tried to write a BigInt type when you
> are
> > >>>> using
> > >>>>> a ValueWriter of type NullableFloat8WriterImpl.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> File  /data/dev/path_to/dump.json
> > >>>>>
> > >>>>> Record  1
> > >>>>>
> > >>>>> Line  1
> > >>>>>
> > >>>>> Column  9054
> > >>>>>
> > >>>>> Field  entropy
> > >>>>>
> > >>>>> Fragment 0:0
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> This isn’t incredibly helpful from a user perspective.  I.e. When I
> > >>>> Google
> > >>>>> around, I realize now that in the docs it talks about “Schema
> > >> Changes”
> > >>>> and
> > >>>>> one possible item is use the setting below. However, examples of
> the
> > >> data
> > >>>>> that was trying to be displayed (with it’s implied type) may help
> > >> users
> > >>>>> grok what is happening.  At least in this case it showed me the
> field
> > >>>> name!
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> This is a great example where since we have known use case (when
> > >> numbers
> > >>>>> are doubles but someone tries to store 0 an INT) it fails, thus
> dev’s
> > >>>> have
> > >>>>> added a setting to allow a user to get through that, that the error
> > >>>> message
> > >>>>> could be more helpful.   In this case, Showing two record numbers
> > >> (line
> > >>>>> numbers) with different types, the field values with their implied
> > >> types,
> > >>>>> and perhaps a suggestion about using the setting to address the
> > >> problem.
> > >>>>> This could make it more intuitive for the user to stay in Drill,
> and
> > >> stay
> > >>>>> in the data.   In this case, I looked at a head of the file, and
> saw
> > >> the
> > >>>>> issue and was able to proceed.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Also, as a corollary here, the user documentation does not show
> this
> > >>>> error
> > >>>>> related to the schema change problem. This would be a great place
> to
> > >>>> state,
> > >>>>> “if you see an error that looks like X, this is what is happening
> and
> > >>>> what
> > >>>>> you can do for it.”
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> *Side node on documentation*
> > >>>>>
> > >>>>> We should look to have documentation try to be role based.   In
> this
> > >>>> case,
> > >>>>> the documentation says use “ALTER SYSTEM” I would argue, and I am
> > >>>> guessing
> > >>>>> others would concur, that for this use case, “ALTER SESSION” may
> be a
> > >>>>> better suggestion as this is specific alteration to address the use
> > >> case
> > >>>> of
> > >>>>> loading/querying a specific data set, and is likely done by a user
> > >> of the
> > >>>>> system.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> If a user is doing self-serve data, then in an enterprise
> > >> environment,
> > >>>> they
> > >>>>> may not have the ability to use ALTER SYSTEM and get an error, thus
> > >> may
> > >>>> be
> > >>>>> confused on how to proceed.   In addition ALTER SYSTEM by a user
> who
> > >>>>> doesn’t understand that they are changing, yet have the rights to
> > >> change,
> > >>>>> may introduce future data problems they didn’t expect.   I like
> that
> > >> the
> > >>>>> default is a more constrictive method, because it makes people be
> > >>>> explicit
> > >>>>> about data, yet the documentation should also aim to be explicit
> > >> about
> > >>>>> something like a system wide change.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> *Back to the story*
> > >>>>>
> > >>>>> Ok so now I will do ALTER SESSION SET on the read_numbers_as_double
> > >>>> setting
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> I run the query again.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>> select * from `path_to/dump.json` limit 1;
> > >>>>>
> > >>>>> Error: DATA_READ ERROR: Error parsing JSON - You tried to write a
> > >> VarChar
> > >>>>> type when you are using a ValueWriter of type SingleMapWriter.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> File  /data/dev/path_to/dump.json
> > >>>>>
> > >>>>> Record  4009
> > >>>>>
> > >>>>> Fragment 0:0
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Another error   But what does this one mean? Ok, now that I have
> been
> > >>>>> living in the docs and in the Drill user list, and because it’s
> > >> similar
> > >>>> to
> > >>>>> the schema change issue, that that is what we are looking at here.
> > >>>> Instead
> > >>>>> of double to int, we have one field that is map most of the time,
> > >> and in
> > >>>>> some cases it’s a string.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> But this doesn’t really help me as a user.  To troubleshoot this
> > >> Drill
> > >>>>> doesn’t offer any options. This file is 500 MB of dense and nested
> > >> JSON
> > >>>>> data with 51k records.   My solution? I took the record number,
> then
> > >> I
> > >>>> went
> > >>>>> to my NFS mounted clustered file system (thank goodness I had MapR
> > >> here,
> > >>>> I
> > >>>>> am not sure how I would have done this with Posix tools)
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> My command: $ head -4009 dump.json|tail -1
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> That (I hoped) showed me the record in question, note the error
> from
> > >>>> Drill
> > >>>>> didn’t tell me which field was at fault here, so I had to visually
> > >> align
> > >>>>> things to address that.  However, I was able to spot the difference
> > >> and
> > >>>>> work with the dev to understand why that happened. I removed those
> > >>>> records,
> > >>>>> and things worked correctly.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Could there have been a way to identify that within drill? My
> > >> solution
> > >>>> was
> > >>>>> to take a python script and read through, and discard those records
> > >> that
> > >>>>> were not a map, however, on 500MB that can work, but what about 500
> > >> GB?
> > >>>> I
> > >>>>> guess a Spark job could clean the data…. But could Drill be given
> > >> some
> > >>>>> tools to help with this situation?
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> For example, the first thing I said was: What field is at issue?  I
> > >> had
> > >>>> no
> > >>>>> way to see what was up there.  I had to use other tools to see the
> > >> data
> > >>>> so
> > >>>>> I could understand the problem. Then when I understood the problem,
> > >> I had
> > >>>>> to use Python to produce data that was queryable.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Based on the design document Hsuan Yi Chu just posted to the
> mailing
> > >>>> list,
> > >>>>> at this point my post is just a user story to support the design
> > >>>> document.
> > >>>>> To summarize the points I’d like to see included in the design
> > >> document
> > >>>>> (from a user perspective), not understanding “how or why”:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> *1.     **Error messages that are more verbose in explaining the
> > >> problem*
> > >>>>>
> > >>>>> a.     Filename, row number, column number or name
> > >>>>>
> > >>>>> b.     Option to output the “offending row”
> > >>>>>
> > >>>>> c.     Showing the data that is causing the error WITH the type
> Drill
> > >>>>> inferred.
> > >>>>>
> > >>>>> d.     If there are options to help work through dirty data,
> perhaps
> > >> the
> > >>>>> error message could include those: “Data was an double, then drill
> > >> found
> > >>>>> this data: 0 that was a int in File x, at row 24 in column
> > >>>> “myfloatingdata”
> > >>>>> consider using store.json.read_numbers_as_double to address the
> > >> issue.
> > >>>>>
> > >>>>> 2.     *A way to determine how common this exception is*
> > >>>>>
> > >>>>> a.     If I am playing with a messy data set, and this error
> happens,
> > >>>> does
> > >>>>> it happen on 1 record? 2? 5000?  Knowing that information would:
> > >>>>>
> > >>>>>                                               i.     Help users
> > >>>> understand
> > >>>>> how Drill is seeing that particular column
> > >>>>>
> > >>>>>                                             ii.     Make decisions
> > >> on
> > >>>>> excluding data rather than just removing it. What if the first 10
> > >> records
> > >>>>> were errors, and then you excluded the remaining 10 million because
> > >> they
> > >>>>> were correct yet different from the first 10?
> > >>>>>
> > >>>>> b.     Perhaps there could be a “stats” function that only works if
> > >> it’s
> > >>>>> the only selected item or if the select is all those functions
> (stats
> > >>>>> functions)?
> > >>>>>
> > >>>>>                                               i.     Select
> > >>>>> type_stats(fieldsname) from data
> > >>>>>
> > >>>>>                                             ii.      (that wouldn’t
> > >>>> error
> > >>>>> on different types)
> > >>>>>
> > >>>>> 3.     *An ability to set a “return null on this field if error or
> > >> if non
> > >>>>> castable to X type, especially in a view, perhaps in a function.*
> > >>>>>
> > >>>>> a.     Allow them to not have to reparse data outside drill
> > >>>>>
> > >>>>> b.     Load it into a sane format (one time loads/ETL to clean
> data)
> > >>>>>
> > >>>>> c.     Not be system or session wide exception.
> > >>>>>
> > >>>>>                                               i.     I think this
> is
> > >>>>> important because I may have a field where I want it to read the
> > >> numbers
> > >>>> as
> > >>>>> double, but what if I have another field in the same dataset where
> I
> > >>>> don’t
> > >>>>> want it to read the numbers as double? A SYSTEM or SESSION level
> > >> variable
> > >>>>> takes away that granularity
> > >>>>>
> > >>>>> d.     Select field1, CASTORNULL(field2, int) as field2,
> > >>>> CASTORNULL(field3,
> > >>>>> double) as field3 from ugly_data.
> > >>>>>
> > >>>>> e.     That’s an example when it’s in the select, but I Could see a
> > >> where
> > >>>>> clause
> > >>>>>
> > >>>>> f.      Select field1, field2, field3 from ugly data where
> > >> ISTYPE(field2,
> > >>>>> int) and ISTYPE(field3, double)
> > >>>>>
> > >>>>> 4.     *Updating of the documentation related to ALTER SESSION vs
> > >> ALTER
> > >>>>> SYSTEM with an eye to the context of the majority use case of the
> > >>>>> documented feature*
> > >>>>>
> > >>>>> a.     For data loads, the documentation uses ALTER SYSTEM and
> that’s
> > >>>>> problematic because:
> > >>>>>
> > >>>>>                                               i.     Not all users
> > >> have
> > >>>>> the privileges to issue an ALTER SYSTEM. Thus a new user trying to
> > >> figure
> > >>>>> things out may not realize they can just ALTER SESSION after
> getting
> > >> an
> > >>>>> ALTER SYSTEM error.
> > >>>>>
> > >>>>>                                             ii.     ALTER SYSTEM on
> > >> data
> > >>>>> loading items, especially in areas that make Drill’s data
> > >> interpretation
> > >>>>> more permissive can lead to unintended consequences later. An
> admin,
> > >> who
> > >>>>> may be a good systems admin, and helps a data user troubleshoot and
> > >> error
> > >>>>> may issue an ALTER SYSTEM not realizing this changes all future
> data
> > >>>>> imports.
> > >>>>>
> > >>>>> b.     Note, I found a few cases, but I would suggest a thorough
> > >> review
> > >>>> of
> > >>>>> the various use cases throughout the documentation, and in areas
> > >> where it
> > >>>>> really could be either, have a small paragraph indicating the
> > >>>> ramifications
> > >>>>> of either command.
> > >>>>>
> > >>>>> *5.     **A Philosophy within the Drill Community to “Stay in
> Drill”
> > >> for
> > >>>>> data exploration*
> > >>>>>
> > >>>>> a.     This is obviously not as much of a development thing as a
> > >> mindset.
> > >>>>> If someone says “I tried to do X, and I got and error” and the
> > >>>> communities
> > >>>>> response is Y where Y is “Look through your data and do Z to it so
> > >> Drill
> > >>>>> can read it” then we should reconsider that scenario and try to
> > >> provide
> > >>>> and
> > >>>>> option within Drill to intuitively handle the edge case.  This is
> > >>>>> difficult.
> > >>>>>
> > >>>>> b.     There are cases even in the documentation where this is the
> > >> case:
> > >>>>> https://drill.apache.org/docs/json-data-model/ <
> > >> https://drill.apache.org/docs/json-data-model/> talking about arrays
> at
> > >>>> the
> > >>>>> root level or reading some empty arrays.  In these cases, we have
> to
> > >>>> leave
> > >>>>> drill to fix the problem. This works on small data, but may not
> work
> > >> on
> > >>>>> large or wide data. Consider the  array at root level limitation.
> > >> What
> > >>>> if
> > >>>>> some process out of the users control produces 1000 100mb json
> files
> > >> and
> > >>>> we
> > >>>>> want to read that. To fix it, we have to address those files. Lots
> of
> > >>>> work
> > >>>>> there, either manual or automated.
> > >>>>>
> > >>>>> c.     Once again I know this isn’t easy, but we shouldn’t answer
> > >>>> questions
> > >>>>> about how to do something by saying “fix this outside of Drill so
> > >> Drill
> > >>>> can
> > >>>>> read your data” if at all possible.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> I hope this story helps support the design document presented.  I
> am
> > >>>> happy
> > >>>>> to participate in more discussion around these topics as I have
> > >> enjoying
> > >>>>> digging into the internals of Drill
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> John Omernik
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> *Name: Ganesh Semalty*
> > >>>> *Location: Gurgaon,Haryana(India)*
> > >>>> *Email Id: g4ganeshsemalty@gmail.com <mailto:
> > g4ganeshsemalty@gmail.com>
> > >> <g4ganeshsemalty@gmail.com <ma...@gmail.com>>*
> > >>>>
> > >>>>
> > >>>> P
> > >>>>
> > >>>> *Please consider the environment before printing this e-mail - SAVE
> > >> TREE.*
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Name: Ganesh Semalty
> > >>> Location: Gurgaon,Haryana(India)
> > >>> Email Id: g4ganeshsemalty@gmail.com <mailto:
> g4ganeshsemalty@gmail.com>
> > >>>
> > >>> P
> > >>> Please consider the environment before printing this e-mail - SAVE
> > TREE.
> > >>
> > >>
> > >
> > >
> > > --
> > > *Name: Ganesh Semalty*
> > > *Location: Gurgaon,Haryana(India)*
> > > *Email Id: g4ganeshsemalty@gmail.com <g4...@gmail.com>*
> > >
> > >
> > > P
> > >
> > > *Please consider the environment before printing this e-mail - SAVE
> > TREE.*
> >
> >
>
>
> --
> *Name: Ganesh Semalty*
> *Location: Gurgaon,Haryana(India)*
> *Email Id: g4ganeshsemalty@gmail.com <g4...@gmail.com>*
>
>
> P
>
> *Please consider the environment before printing this e-mail - SAVE TREE.*
>

Re: New Topic: Drill Visualization and Use Support

Posted by John Omernik <jo...@omernik.com>.

This is the pull request for Drill Support in Zeppelin.

https://github.com/apache/incubator-zeppelin/pull/110



On Sat, Oct 24, 2015 at 5:27 AM, ganesh <g4...@gmail.com> wrote:

> Hi,
>
> Regarding apache zeppelin: I think currently it does not support Apache
> drill connectivity by default. I saw following on a site:
>
> https://wiki.apache.org/incubator/ZeppelinProposal
> Initial Goals The initial goals will be to move the existing codebase to
> Apache and integrate with the Apache development process. This includes
> moving all infrastructure that we currently maintain, such as: a website, a
> mailing list, an issues tracker and a Jenkins CI, as mentioned in “Required
> Resources” section of current proposal. Once this is accomplished, *we plan
> for incremental development and releases that follow the Apache guidelines.
> To increase adoption the major goal for the project would be to provide
> integration with as much projects from Apache data ecosystem as possible,
> including new interpreters for Apache Hive, Apache Drill and adding
> Zeppelin distribution to Apache Bigtop*.
>
> Currently I was not able to test much with Tableau .. Didnt get much time
> to spend + whatever I tried till now hasnt been fruitfull with Tableau.
> Will spend some time more ...
>
>
>
> On Thu, Oct 22, 2015 at 8:16 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
>
> > See if the videos on this page helps you.
> >
> > https://www.mapr.com/products/apache-drill <
> > https://www.mapr.com/products/apache-drill>
> >
> >
> >
> >
> > —Andries
> >
> >
> > > On Oct 22, 2015, at 7:38 AM, ganesh <g4...@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > Are there any other links as tutorial for TABLEAU v/s HIVE
> > >
> > > I have already gone through the one in apache-drill site. I am not able
> > to
> > > proceed with those.
> > >
> > > On Thu, Oct 22, 2015 at 8:03 PM, Andries Engelbrecht <
> > > aengelbrecht@maprtech.com> wrote:
> > >
> > >> Hive should be visible and usable in Tableau. You can use Drill Views
> > for
> > >> dfs data, or you can Tableau Custom SQL to access the data.
> > >>
> > >> Make sure to install the Tableau TDC file that comes with the ODBC
> > driver.
> > >> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/ <
> > >> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/>
> > >>
> > https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/
> <
> > >>
> > https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/
> >
> > >>
> > >>
> > >> —Andries
> > >>
> > >>
> > >>> On Oct 22, 2015, at 6:40 AM, ganesh <g4...@gmail.com>
> wrote:
> > >>>
> > >>> Hi John,
> > >>>
> > >>> Thanks for suggesting new name:  Apache Zeppelin
> > >>>
> > >>> I was currently trying 14 Days trial version of TABLEAU with not much
> > >> success.
> > >>> Today only I knew that for files in Hadoop or local file system, I
> > would
> > >> need to create view.
> > >>>
> > >>> Still, though I can see my Tables from HIVE in Tableau, I cannot see
> > any
> > >> data.
> > >>> Nor I am able to use TABLEAU from the links help given in
> apache-drill
> > >> currently (http://drill.apache.org/docs/tableau-examples/ <
> > >> http://drill.apache.org/docs/tableau-examples/>)
> > >>>
> > >>> Snapshot attached, Incase you have worked over TABLEAU
> > >>>
> > >>> I will look into Zeppelin also.
> > >>>
> > >>>
> > >>> On Thu, Oct 22, 2015 at 6:43 PM, John Omernik <john@omernik.com
> > <mailto:
> > >> john@omernik.com>> wrote:
> > >>> I separated my response from the original topic to keep any responses
> > >> there
> > >>> focused on the design document.
> > >>>
> > >>> As to ways to use Drill, I have been working with SQL Squirrel quite
> > >>> successfully.  On my list of things to do, if you want to stay in the
> > >>> Apache world is looking at Apache Zeppelin.  In the Git Repo, there
> is
> > a
> > >>> Drill Plugin so you can run SQL again Drill, look at results, and do
> > >> basic
> > >>> visualizations, I have been trying to way until the PR is merged on
> > >>> Zeppelin, but for your use case, you may want to grab the plugin code
> > and
> > >>> run with it.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Thu, Oct 22, 2015 at 6:06 AM, ganesh <g4ganeshsemalty@gmail.com
> > >> <ma...@gmail.com>> wrote:
> > >>>
> > >>>> Hello,
> > >>>>
> > >>>> John, you seem to be quite impressed with apache drill .. nice.
> > >>>> I am new to un-structured world and just started 1 week back on
> APACHE
> > >>>> DRILL after suggestion from my collegues. We have a semi structured
> > >> data
> > >>>> where we have constraint that we do not know number of columns
> > >>>>
> > >>>> I heard that APACHE DRILL is column free applicationa nd with
> support
> > >> of
> > >>>> JSON format, it allows to create columns on-fly,
> > >>>> I converted my data from CSV-like-format to JSON and trying to
> figure
> > >> out
> > >>>> if it will work for me.
> > >>>>
> > >>>> Here I hit two issues :-
> > >>>> 1) My column were like : 3100.2.1.2 <tel:3100.2.1.2>  and values
> like
> > >> '-2303" or
> > >>>> '01/01/2015
> > >>>> 02:02:00"
> > >>>>
> > >>>> Challenge was that column cant be started with Numeric value. So I
> had
> > >> to
> > >>>> change key as: "t3100.2.1.2"
> > >>>> After that things were quite OK,
> > >>>>
> > >>>> Now I need some help from you guys. To proceed I have to present my
> > >> work to
> > >>>> management as an example.
> > >>>> But querying on apache drill console, doesnt seem to be an
> attractive
> > >> way
> > >>>> to present things.
> > >>>>
> > >>>> I tried drill explorer too.But didnt find that so good.
> > >>>> One thing to note, I am playing with files on Hadoop standalone mode
> > in
> > >>>> UBUNTU.
> > >>>>
> > >>>> To make it appear more good looking, I started with QLIK SENSE ..
> but
> > >> was
> > >>>> unable to connect it with hadoop file system. It only showed me HIVE
> > >> FILES.
> > >>>> Then I downloaded TABLEAU Trial version ... but I am unable to get
> > >> Hadoop
> > >>>> data here too...
> > >>>>
> > >>>> Please help me how to proceed. I have presentation on coming Monday.
> > >>>> Queries are quite ready .. I just need to show in visualization form
> > >>>> ........ using OPEN SOURCE applications only.
> > >>>>
> > >>>>
> > >>>> Guys please help me.
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <john@omernik.com
> > >> <ma...@omernik.com>> wrote:
> > >>>>
> > >>>>> AWESOME!
> > >>>>>
> > >>>>> I had just been in the process of writing up a long user story to
> > >> ask for
> > >>>>> and support exactly this.   I modified it and included it here:
> > >>>>>
> > >>>>>
> > >>>>> To start out, I want to say how much I love the Drill project, and
> > >> the
> > >>>>> potential it has. I've put this together based on my experiences
> and
> > >> want
> > >>>>> to contribute a perspective as a user, not just put a bunch of
> > >> critiques
> > >>>> in
> > >>>>> an email.  I hope it's all taken in that spirit.  Additional note,
> I
> > >>>> wrote
> > >>>>> this prior to seeing the Design Document share by Hsuan Yi Chu
> > >> yesterday.
> > >>>>> If you are reading it, and think to yourself “that wording is odd…”
> > >>>> please
> > >>>>> consider it from the “I didn’t want to throw away the user story”
> > >>>>> perspective and the “I wrote it before the design doc” perspective.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Additionally, I understand that some of what I am suggesting may
> not
> > >> be
> > >>>>> easy from a development perspective.  I am just being upfront with
> my
> > >>>>> experience, so we can look to determine what can be done; I am not
> > >>>> looking
> > >>>>> for a silver bullet here, just looking for improvement.  Some may
> be
> > >> as
> > >>>>> simple as better documentation, other suggestions may be harder to
> > >>>>> implement.  Either way, I thought a verbose user story might be
> > >> useful to
> > >>>>> the community as a whole.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> John
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> *User Story*
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> As I have been working with Drill for data exploration, I came
> across
> > >>>>> multiple "things" that just were hard.  In dealing with some data,
> > >>>>> especially JSON data, it can be ugly, and scaled ugly is even
> worse!
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> For this story, I am working with a JSON dump from MongoDB, and you
> > >> would
> > >>>>> think it would be well structured, and for the most part it is.
> > >> There
> > >>>> are
> > >>>>> some application level mistakes that were made (I will go into that
> > >> in a
> > >>>>> moment), but in general Drill handles this well.  So with this data
> > >> set,
> > >>>>> there are a few main challenges I am seeing:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> 1.     When there is a field that has a float, and then a later
> > >> record
> > >>>> has
> > >>>>> the number 0 in it (which Drill takes as a INT). This is a known
> > >> problem
> > >>>>> and one that Drill has a solution for.
> > >>>>>
> > >>>>> 2.     When there is a field is of one type (a map) and then a
> later
> > >>>> record
> > >>>>> has a string in it.  No easy solution here.
> > >>>>>
> > >>>>> 3.     Select * where there is a json field with a . in the name. I
> > >> won’t
> > >>>>> go into details here, but I feel this factors into data
> exploration,
> > >>>>> because it changes the ability to “stay in Drill” to explore their
> > >> data (
> > >>>>> https://issues.apache.org/jira/browse/DRILL-3922 <
> > >> https://issues.apache.org/jira/browse/DRILL-3922>)
> > >>>>>
> > >>>>> 4.     Error reporting challenges
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> With the problem summary laid out, I wanted to walk through my
> > >> process in
> > >>>>> working with this data, and where, if I were a user Drill could
> have
> > >> been
> > >>>>> much more helpful to the process.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Here is a description of the process I went through:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> 1.     Copy data into filesystem
> > >>>>>
> > >>>>> 2.     Use drill to “Select * from `path_to/dump.json` limit 1
> > >>>>>
> > >>>>> 3.     (I just want to see what it looks like!)
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Here I get this error:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>> select * from `path_to/ dump.json` limit 1;
> > >>>>>
> > >>>>> Error: DATA_READ ERROR: You tried to write a BigInt type when you
> are
> > >>>> using
> > >>>>> a ValueWriter of type NullableFloat8WriterImpl.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> File  /data/dev/path_to/dump.json
> > >>>>>
> > >>>>> Record  1
> > >>>>>
> > >>>>> Line  1
> > >>>>>
> > >>>>> Column  9054
> > >>>>>
> > >>>>> Field  entropy
> > >>>>>
> > >>>>> Fragment 0:0
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> This isn’t incredibly helpful from a user perspective.  I.e. When I
> > >>>> Google
> > >>>>> around, I realize now that in the docs it talks about “Schema
> > >> Changes”
> > >>>> and
> > >>>>> one possible item is use the setting below. However, examples of
> the
> > >> data
> > >>>>> that was trying to be displayed (with it’s implied type) may help
> > >> users
> > >>>>> grok what is happening.  At least in this case it showed me the
> field
> > >>>> name!
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> This is a great example where since we have known use case (when
> > >> numbers
> > >>>>> are doubles but someone tries to store 0 an INT) it fails, thus
> dev’s
> > >>>> have
> > >>>>> added a setting to allow a user to get through that, that the error
> > >>>> message
> > >>>>> could be more helpful.   In this case, Showing two record numbers
> > >> (line
> > >>>>> numbers) with different types, the field values with their implied
> > >> types,
> > >>>>> and perhaps a suggestion about using the setting to address the
> > >> problem.
> > >>>>> This could make it more intuitive for the user to stay in Drill,
> and
> > >> stay
> > >>>>> in the data.   In this case, I looked at a head of the file, and
> saw
> > >> the
> > >>>>> issue and was able to proceed.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Also, as a corollary here, the user documentation does not show
> this
> > >>>> error
> > >>>>> related to the schema change problem. This would be a great place
> to
> > >>>> state,
> > >>>>> “if you see an error that looks like X, this is what is happening
> and
> > >>>> what
> > >>>>> you can do for it.”
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> *Side node on documentation*
> > >>>>>
> > >>>>> We should look to have documentation try to be role based.   In
> this
> > >>>> case,
> > >>>>> the documentation says use “ALTER SYSTEM” I would argue, and I am
> > >>>> guessing
> > >>>>> others would concur, that for this use case, “ALTER SESSION” may
> be a
> > >>>>> better suggestion as this is specific alteration to address the use
> > >> case
> > >>>> of
> > >>>>> loading/querying a specific data set, and is likely done by a user
> > >> of the
> > >>>>> system.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> If a user is doing self-serve data, then in an enterprise
> > >> environment,
> > >>>> they
> > >>>>> may not have the ability to use ALTER SYSTEM and get an error, thus
> > >> may
> > >>>> be
> > >>>>> confused on how to proceed.   In addition ALTER SYSTEM by a user
> who
> > >>>>> doesn’t understand that they are changing, yet have the rights to
> > >> change,
> > >>>>> may introduce future data problems they didn’t expect.   I like
> that
> > >> the
> > >>>>> default is a more constrictive method, because it makes people be
> > >>>> explicit
> > >>>>> about data, yet the documentation should also aim to be explicit
> > >> about
> > >>>>> something like a system wide change.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> *Back to the story*
> > >>>>>
> > >>>>> Ok so now I will do ALTER SESSION SET on the read_numbers_as_double
> > >>>> setting
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> I run the query again.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>> select * from `path_to/dump.json` limit 1;
> > >>>>>
> > >>>>> Error: DATA_READ ERROR: Error parsing JSON - You tried to write a
> > >> VarChar
> > >>>>> type when you are using a ValueWriter of type SingleMapWriter.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> File  /data/dev/path_to/dump.json
> > >>>>>
> > >>>>> Record  4009
> > >>>>>
> > >>>>> Fragment 0:0
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Another error   But what does this one mean? Ok, now that I have
> been
> > >>>>> living in the docs and in the Drill user list, and because it’s
> > >> similar
> > >>>> to
> > >>>>> the schema change issue, that that is what we are looking at here.
> > >>>> Instead
> > >>>>> of double to int, we have one field that is map most of the time,
> > >> and in
> > >>>>> some cases it’s a string.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> But this doesn’t really help me as a user.  To troubleshoot this
> > >> Drill
> > >>>>> doesn’t offer any options. This file is 500 MB of dense and nested
> > >> JSON
> > >>>>> data with 51k records.   My solution? I took the record number,
> then
> > >> I
> > >>>> went
> > >>>>> to my NFS mounted clustered file system (thank goodness I had MapR
> > >> here,
> > >>>> I
> > >>>>> am not sure how I would have done this with Posix tools)
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> My command: $ head -4009 dump.json|tail -1
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> That (I hoped) showed me the record in question, note the error
> from
> > >>>> Drill
> > >>>>> didn’t tell me which field was at fault here, so I had to visually
> > >> align
> > >>>>> things to address that.  However, I was able to spot the difference
> > >> and
> > >>>>> work with the dev to understand why that happened. I removed those
> > >>>> records,
> > >>>>> and things worked correctly.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Could there have been a way to identify that within drill? My
> > >> solution
> > >>>> was
> > >>>>> to take a python script and read through, and discard those records
> > >> that
> > >>>>> were not a map, however, on 500MB that can work, but what about 500
> > >> GB?
> > >>>> I
> > >>>>> guess a Spark job could clean the data…. But could Drill be given
> > >> some
> > >>>>> tools to help with this situation?
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> For example, the first thing I said was: What field is at issue?  I
> > >> had
> > >>>> no
> > >>>>> way to see what was up there.  I had to use other tools to see the
> > >> data
> > >>>> so
> > >>>>> I could understand the problem. Then when I understood the problem,
> > >> I had
> > >>>>> to use Python to produce data that was queryable.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Based on the design document Hsuan Yi Chu just posted to the
> mailing
> > >>>> list,
> > >>>>> at this point my post is just a user story to support the design
> > >>>> document.
> > >>>>> To summarize the points I’d like to see included in the design
> > >> document
> > >>>>> (from a user perspective), not understanding “how or why”:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> *1.     **Error messages that are more verbose in explaining the
> > >> problem*
> > >>>>>
> > >>>>> a.     Filename, row number, column number or name
> > >>>>>
> > >>>>> b.     Option to output the “offending row”
> > >>>>>
> > >>>>> c.     Showing the data that is causing the error WITH the type
> Drill
> > >>>>> inferred.
> > >>>>>
> > >>>>> d.     If there are options to help work through dirty data,
> perhaps
> > >> the
> > >>>>> error message could include those: “Data was an double, then drill
> > >> found
> > >>>>> this data: 0 that was a int in File x, at row 24 in column
> > >>>> “myfloatingdata”
> > >>>>> consider using store.json.read_numbers_as_double to address the
> > >> issue.
> > >>>>>
> > >>>>> 2.     *A way to determine how common this exception is*
> > >>>>>
> > >>>>> a.     If I am playing with a messy data set, and this error
> happens,
> > >>>> does
> > >>>>> it happen on 1 record? 2? 5000?  Knowing that information would:
> > >>>>>
> > >>>>>                                               i.     Help users
> > >>>> understand
> > >>>>> how Drill is seeing that particular column
> > >>>>>
> > >>>>>                                             ii.     Make decisions
> > >> on
> > >>>>> excluding data rather than just removing it. What if the first 10
> > >> records
> > >>>>> were errors, and then you excluded the remaining 10 million because
> > >> they
> > >>>>> were correct yet different from the first 10?
> > >>>>>
> > >>>>> b.     Perhaps there could be a “stats” function that only works if
> > >> it’s
> > >>>>> the only selected item or if the select is all those functions
> (stats
> > >>>>> functions)?
> > >>>>>
> > >>>>>                                               i.     Select
> > >>>>> type_stats(fieldsname) from data
> > >>>>>
> > >>>>>                                             ii.      (that wouldn’t
> > >>>> error
> > >>>>> on different types)
> > >>>>>
> > >>>>> 3.     *An ability to set a “return null on this field if error or
> > >> if non
> > >>>>> castable to X type, especially in a view, perhaps in a function.*
> > >>>>>
> > >>>>> a.     Allow them to not have to reparse data outside drill
> > >>>>>
> > >>>>> b.     Load it into a sane format (one time loads/ETL to clean
> data)
> > >>>>>
> > >>>>> c.     Not be system or session wide exception.
> > >>>>>
> > >>>>>                                               i.     I think this
> is
> > >>>>> important because I may have a field where I want it to read the
> > >> numbers
> > >>>> as
> > >>>>> double, but what if I have another field in the same dataset where
> I
> > >>>> don’t
> > >>>>> want it to read the numbers as double? A SYSTEM or SESSION level
> > >> variable
> > >>>>> takes away that granularity
> > >>>>>
> > >>>>> d.     Select field1, CASTORNULL(field2, int) as field2,
> > >>>> CASTORNULL(field3,
> > >>>>> double) as field3 from ugly_data.
> > >>>>>
> > >>>>> e.     That’s an example when it’s in the select, but I Could see a
> > >> where
> > >>>>> clause
> > >>>>>
> > >>>>> f.      Select field1, field2, field3 from ugly data where
> > >> ISTYPE(field2,
> > >>>>> int) and ISTYPE(field3, double)
> > >>>>>
> > >>>>> 4.     *Updating of the documentation related to ALTER SESSION vs
> > >> ALTER
> > >>>>> SYSTEM with an eye to the context of the majority use case of the
> > >>>>> documented feature*
> > >>>>>
> > >>>>> a.     For data loads, the documentation uses ALTER SYSTEM and
> that’s
> > >>>>> problematic because:
> > >>>>>
> > >>>>>                                               i.     Not all users
> > >> have
> > >>>>> the privileges to issue an ALTER SYSTEM. Thus a new user trying to
> > >> figure
> > >>>>> things out may not realize they can just ALTER SESSION after
> getting
> > >> an
> > >>>>> ALTER SYSTEM error.
> > >>>>>
> > >>>>>                                             ii.     ALTER SYSTEM on
> > >> data
> > >>>>> loading items, especially in areas that make Drill’s data
> > >> interpretation
> > >>>>> more permissive can lead to unintended consequences later. An
> admin,
> > >> who
> > >>>>> may be a good systems admin, and helps a data user troubleshoot and
> > >> error
> > >>>>> may issue an ALTER SYSTEM not realizing this changes all future
> data
> > >>>>> imports.
> > >>>>>
> > >>>>> b.     Note, I found a few cases, but I would suggest a thorough
> > >> review
> > >>>> of
> > >>>>> the various use cases throughout the documentation, and in areas
> > >> where it
> > >>>>> really could be either, have a small paragraph indicating the
> > >>>> ramifications
> > >>>>> of either command.
> > >>>>>
> > >>>>> *5.     **A Philosophy within the Drill Community to “Stay in
> Drill”
> > >> for
> > >>>>> data exploration*
> > >>>>>
> > >>>>> a.     This is obviously not as much of a development thing as a
> > >> mindset.
> > >>>>> If someone says “I tried to do X, and I got and error” and the
> > >>>> communities
> > >>>>> response is Y where Y is “Look through your data and do Z to it so
> > >> Drill
> > >>>>> can read it” then we should reconsider that scenario and try to
> > >> provide
> > >>>> and
> > >>>>> option within Drill to intuitively handle the edge case.  This is
> > >>>>> difficult.
> > >>>>>
> > >>>>> b.     There are cases even in the documentation where this is the
> > >> case:
> > >>>>> https://drill.apache.org/docs/json-data-model/ <
> > >> https://drill.apache.org/docs/json-data-model/> talking about arrays
> at
> > >>>> the
> > >>>>> root level or reading some empty arrays.  In these cases, we have
> to
> > >>>> leave
> > >>>>> drill to fix the problem. This works on small data, but may not
> work
> > >> on
> > >>>>> large or wide data. Consider the  array at root level limitation.
> > >> What
> > >>>> if
> > >>>>> some process out of the users control produces 1000 100mb json
> files
> > >> and
> > >>>> we
> > >>>>> want to read that. To fix it, we have to address those files. Lots
> of
> > >>>> work
> > >>>>> there, either manual or automated.
> > >>>>>
> > >>>>> c.     Once again I know this isn’t easy, but we shouldn’t answer
> > >>>> questions
> > >>>>> about how to do something by saying “fix this outside of Drill so
> > >> Drill
> > >>>> can
> > >>>>> read your data” if at all possible.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> I hope this story helps support the design document presented.  I
> am
> > >>>> happy
> > >>>>> to participate in more discussion around these topics as I have
> > >> enjoying
> > >>>>> digging into the internals of Drill
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> John Omernik
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> *Name: Ganesh Semalty*
> > >>>> *Location: Gurgaon,Haryana(India)*
> > >>>> *Email Id: g4ganeshsemalty@gmail.com <mailto:
> > g4ganeshsemalty@gmail.com>
> > >> <g4ganeshsemalty@gmail.com <ma...@gmail.com>>*
> > >>>>
> > >>>>
> > >>>> P
> > >>>>
> > >>>> *Please consider the environment before printing this e-mail - SAVE
> > >> TREE.*
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Name: Ganesh Semalty
> > >>> Location: Gurgaon,Haryana(India)
> > >>> Email Id: g4ganeshsemalty@gmail.com <mailto:
> g4ganeshsemalty@gmail.com>
> > >>>
> > >>> P
> > >>> Please consider the environment before printing this e-mail - SAVE
> > TREE.
> > >>
> > >>
> > >
> > >
> > > --
> > > *Name: Ganesh Semalty*
> > > *Location: Gurgaon,Haryana(India)*
> > > *Email Id: g4ganeshsemalty@gmail.com <g4...@gmail.com>*
> > >
> > >
> > > P
> > >
> > > *Please consider the environment before printing this e-mail - SAVE
> > TREE.*
> >
> >
>
>
> --
> *Name: Ganesh Semalty*
> *Location: Gurgaon,Haryana(India)*
> *Email Id: g4ganeshsemalty@gmail.com <g4...@gmail.com>*
>
>
> P
>
> *Please consider the environment before printing this e-mail - SAVE TREE.*
>

Re: New Topic: Drill Visualization and Use Support

Posted by ganesh <g4...@gmail.com>.

Hi,

Regarding apache zeppelin: I think currently it does not support Apache
drill connectivity by default. I saw following on a site:

https://wiki.apache.org/incubator/ZeppelinProposal
Initial Goals The initial goals will be to move the existing codebase to
Apache and integrate with the Apache development process. This includes
moving all infrastructure that we currently maintain, such as: a website, a
mailing list, an issues tracker and a Jenkins CI, as mentioned in “Required
Resources” section of current proposal. Once this is accomplished, *we plan
for incremental development and releases that follow the Apache guidelines.
To increase adoption the major goal for the project would be to provide
integration with as much projects from Apache data ecosystem as possible,
including new interpreters for Apache Hive, Apache Drill and adding
Zeppelin distribution to Apache Bigtop*.

Currently I was not able to test much with Tableau .. Didnt get much time
to spend + whatever I tried till now hasnt been fruitfull with Tableau.
Will spend some time more ...



On Thu, Oct 22, 2015 at 8:16 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> See if the videos on this page helps you.
>
> https://www.mapr.com/products/apache-drill <
> https://www.mapr.com/products/apache-drill>
>
>
>
>
> —Andries
>
>
> > On Oct 22, 2015, at 7:38 AM, ganesh <g4...@gmail.com> wrote:
> >
> > Hello,
> >
> > Are there any other links as tutorial for TABLEAU v/s HIVE
> >
> > I have already gone through the one in apache-drill site. I am not able
> to
> > proceed with those.
> >
> > On Thu, Oct 22, 2015 at 8:03 PM, Andries Engelbrecht <
> > aengelbrecht@maprtech.com> wrote:
> >
> >> Hive should be visible and usable in Tableau. You can use Drill Views
> for
> >> dfs data, or you can Tableau Custom SQL to access the data.
> >>
> >> Make sure to install the Tableau TDC file that comes with the ODBC
> driver.
> >> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/ <
> >> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/>
> >>
> https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/ <
> >>
> https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/>
> >>
> >>
> >> —Andries
> >>
> >>
> >>> On Oct 22, 2015, at 6:40 AM, ganesh <g4...@gmail.com> wrote:
> >>>
> >>> Hi John,
> >>>
> >>> Thanks for suggesting new name:  Apache Zeppelin
> >>>
> >>> I was currently trying 14 Days trial version of TABLEAU with not much
> >> success.
> >>> Today only I knew that for files in Hadoop or local file system, I
> would
> >> need to create view.
> >>>
> >>> Still, though I can see my Tables from HIVE in Tableau, I cannot see
> any
> >> data.
> >>> Nor I am able to use TABLEAU from the links help given in apache-drill
> >> currently (http://drill.apache.org/docs/tableau-examples/ <
> >> http://drill.apache.org/docs/tableau-examples/>)
> >>>
> >>> Snapshot attached, Incase you have worked over TABLEAU
> >>>
> >>> I will look into Zeppelin also.
> >>>
> >>>
> >>> On Thu, Oct 22, 2015 at 6:43 PM, John Omernik <john@omernik.com
> <mailto:
> >> john@omernik.com>> wrote:
> >>> I separated my response from the original topic to keep any responses
> >> there
> >>> focused on the design document.
> >>>
> >>> As to ways to use Drill, I have been working with SQL Squirrel quite
> >>> successfully.  On my list of things to do, if you want to stay in the
> >>> Apache world is looking at Apache Zeppelin.  In the Git Repo, there is
> a
> >>> Drill Plugin so you can run SQL again Drill, look at results, and do
> >> basic
> >>> visualizations, I have been trying to way until the PR is merged on
> >>> Zeppelin, but for your use case, you may want to grab the plugin code
> and
> >>> run with it.
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Oct 22, 2015 at 6:06 AM, ganesh <g4ganeshsemalty@gmail.com
> >> <ma...@gmail.com>> wrote:
> >>>
> >>>> Hello,
> >>>>
> >>>> John, you seem to be quite impressed with apache drill .. nice.
> >>>> I am new to un-structured world and just started 1 week back on APACHE
> >>>> DRILL after suggestion from my collegues. We have a semi structured
> >> data
> >>>> where we have constraint that we do not know number of columns
> >>>>
> >>>> I heard that APACHE DRILL is column free applicationa nd with support
> >> of
> >>>> JSON format, it allows to create columns on-fly,
> >>>> I converted my data from CSV-like-format to JSON and trying to figure
> >> out
> >>>> if it will work for me.
> >>>>
> >>>> Here I hit two issues :-
> >>>> 1) My column were like : 3100.2.1.2 <tel:3100.2.1.2>  and values like
> >> '-2303" or
> >>>> '01/01/2015
> >>>> 02:02:00"
> >>>>
> >>>> Challenge was that column cant be started with Numeric value. So I had
> >> to
> >>>> change key as: "t3100.2.1.2"
> >>>> After that things were quite OK,
> >>>>
> >>>> Now I need some help from you guys. To proceed I have to present my
> >> work to
> >>>> management as an example.
> >>>> But querying on apache drill console, doesnt seem to be an attractive
> >> way
> >>>> to present things.
> >>>>
> >>>> I tried drill explorer too.But didnt find that so good.
> >>>> One thing to note, I am playing with files on Hadoop standalone mode
> in
> >>>> UBUNTU.
> >>>>
> >>>> To make it appear more good looking, I started with QLIK SENSE .. but
> >> was
> >>>> unable to connect it with hadoop file system. It only showed me HIVE
> >> FILES.
> >>>> Then I downloaded TABLEAU Trial version ... but I am unable to get
> >> Hadoop
> >>>> data here too...
> >>>>
> >>>> Please help me how to proceed. I have presentation on coming Monday.
> >>>> Queries are quite ready .. I just need to show in visualization form
> >>>> ........ using OPEN SOURCE applications only.
> >>>>
> >>>>
> >>>> Guys please help me.
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <john@omernik.com
> >> <ma...@omernik.com>> wrote:
> >>>>
> >>>>> AWESOME!
> >>>>>
> >>>>> I had just been in the process of writing up a long user story to
> >> ask for
> >>>>> and support exactly this.   I modified it and included it here:
> >>>>>
> >>>>>
> >>>>> To start out, I want to say how much I love the Drill project, and
> >> the
> >>>>> potential it has. I've put this together based on my experiences and
> >> want
> >>>>> to contribute a perspective as a user, not just put a bunch of
> >> critiques
> >>>> in
> >>>>> an email.  I hope it's all taken in that spirit.  Additional note, I
> >>>> wrote
> >>>>> this prior to seeing the Design Document share by Hsuan Yi Chu
> >> yesterday.
> >>>>> If you are reading it, and think to yourself “that wording is odd…”
> >>>> please
> >>>>> consider it from the “I didn’t want to throw away the user story”
> >>>>> perspective and the “I wrote it before the design doc” perspective.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Additionally, I understand that some of what I am suggesting may not
> >> be
> >>>>> easy from a development perspective.  I am just being upfront with my
> >>>>> experience, so we can look to determine what can be done; I am not
> >>>> looking
> >>>>> for a silver bullet here, just looking for improvement.  Some may be
> >> as
> >>>>> simple as better documentation, other suggestions may be harder to
> >>>>> implement.  Either way, I thought a verbose user story might be
> >> useful to
> >>>>> the community as a whole.
> >>>>>
> >>>>>
> >>>>>
> >>>>> John
> >>>>>
> >>>>>
> >>>>>
> >>>>> *User Story*
> >>>>>
> >>>>>
> >>>>>
> >>>>> As I have been working with Drill for data exploration, I came across
> >>>>> multiple "things" that just were hard.  In dealing with some data,
> >>>>> especially JSON data, it can be ugly, and scaled ugly is even worse!
> >>>>>
> >>>>>
> >>>>>
> >>>>> For this story, I am working with a JSON dump from MongoDB, and you
> >> would
> >>>>> think it would be well structured, and for the most part it is.
> >> There
> >>>> are
> >>>>> some application level mistakes that were made (I will go into that
> >> in a
> >>>>> moment), but in general Drill handles this well.  So with this data
> >> set,
> >>>>> there are a few main challenges I am seeing:
> >>>>>
> >>>>>
> >>>>>
> >>>>> 1.     When there is a field that has a float, and then a later
> >> record
> >>>> has
> >>>>> the number 0 in it (which Drill takes as a INT). This is a known
> >> problem
> >>>>> and one that Drill has a solution for.
> >>>>>
> >>>>> 2.     When there is a field is of one type (a map) and then a later
> >>>> record
> >>>>> has a string in it.  No easy solution here.
> >>>>>
> >>>>> 3.     Select * where there is a json field with a . in the name. I
> >> won’t
> >>>>> go into details here, but I feel this factors into data exploration,
> >>>>> because it changes the ability to “stay in Drill” to explore their
> >> data (
> >>>>> https://issues.apache.org/jira/browse/DRILL-3922 <
> >> https://issues.apache.org/jira/browse/DRILL-3922>)
> >>>>>
> >>>>> 4.     Error reporting challenges
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> With the problem summary laid out, I wanted to walk through my
> >> process in
> >>>>> working with this data, and where, if I were a user Drill could have
> >> been
> >>>>> much more helpful to the process.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Here is a description of the process I went through:
> >>>>>
> >>>>>
> >>>>>
> >>>>> 1.     Copy data into filesystem
> >>>>>
> >>>>> 2.     Use drill to “Select * from `path_to/dump.json` limit 1
> >>>>>
> >>>>> 3.     (I just want to see what it looks like!)
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> Here I get this error:
> >>>>>
> >>>>>
> >>>>>
> >>>>>> select * from `path_to/ dump.json` limit 1;
> >>>>>
> >>>>> Error: DATA_READ ERROR: You tried to write a BigInt type when you are
> >>>> using
> >>>>> a ValueWriter of type NullableFloat8WriterImpl.
> >>>>>
> >>>>>
> >>>>>
> >>>>> File  /data/dev/path_to/dump.json
> >>>>>
> >>>>> Record  1
> >>>>>
> >>>>> Line  1
> >>>>>
> >>>>> Column  9054
> >>>>>
> >>>>> Field  entropy
> >>>>>
> >>>>> Fragment 0:0
> >>>>>
> >>>>>
> >>>>>
> >>>>> This isn’t incredibly helpful from a user perspective.  I.e. When I
> >>>> Google
> >>>>> around, I realize now that in the docs it talks about “Schema
> >> Changes”
> >>>> and
> >>>>> one possible item is use the setting below. However, examples of the
> >> data
> >>>>> that was trying to be displayed (with it’s implied type) may help
> >> users
> >>>>> grok what is happening.  At least in this case it showed me the field
> >>>> name!
> >>>>>
> >>>>>
> >>>>>
> >>>>> ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
> >>>>>
> >>>>>
> >>>>>
> >>>>> This is a great example where since we have known use case (when
> >> numbers
> >>>>> are doubles but someone tries to store 0 an INT) it fails, thus dev’s
> >>>> have
> >>>>> added a setting to allow a user to get through that, that the error
> >>>> message
> >>>>> could be more helpful.   In this case, Showing two record numbers
> >> (line
> >>>>> numbers) with different types, the field values with their implied
> >> types,
> >>>>> and perhaps a suggestion about using the setting to address the
> >> problem.
> >>>>> This could make it more intuitive for the user to stay in Drill, and
> >> stay
> >>>>> in the data.   In this case, I looked at a head of the file, and saw
> >> the
> >>>>> issue and was able to proceed.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Also, as a corollary here, the user documentation does not show this
> >>>> error
> >>>>> related to the schema change problem. This would be a great place to
> >>>> state,
> >>>>> “if you see an error that looks like X, this is what is happening and
> >>>> what
> >>>>> you can do for it.”
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> *Side node on documentation*
> >>>>>
> >>>>> We should look to have documentation try to be role based.   In this
> >>>> case,
> >>>>> the documentation says use “ALTER SYSTEM” I would argue, and I am
> >>>> guessing
> >>>>> others would concur, that for this use case, “ALTER SESSION” may be a
> >>>>> better suggestion as this is specific alteration to address the use
> >> case
> >>>> of
> >>>>> loading/querying a specific data set, and is likely done by a user
> >> of the
> >>>>> system.
> >>>>>
> >>>>>
> >>>>>
> >>>>> If a user is doing self-serve data, then in an enterprise
> >> environment,
> >>>> they
> >>>>> may not have the ability to use ALTER SYSTEM and get an error, thus
> >> may
> >>>> be
> >>>>> confused on how to proceed.   In addition ALTER SYSTEM by a user who
> >>>>> doesn’t understand that they are changing, yet have the rights to
> >> change,
> >>>>> may introduce future data problems they didn’t expect.   I like that
> >> the
> >>>>> default is a more constrictive method, because it makes people be
> >>>> explicit
> >>>>> about data, yet the documentation should also aim to be explicit
> >> about
> >>>>> something like a system wide change.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> *Back to the story*
> >>>>>
> >>>>> Ok so now I will do ALTER SESSION SET on the read_numbers_as_double
> >>>> setting
> >>>>>
> >>>>>
> >>>>>
> >>>>> I run the query again.
> >>>>>
> >>>>>
> >>>>>
> >>>>>> select * from `path_to/dump.json` limit 1;
> >>>>>
> >>>>> Error: DATA_READ ERROR: Error parsing JSON - You tried to write a
> >> VarChar
> >>>>> type when you are using a ValueWriter of type SingleMapWriter.
> >>>>>
> >>>>>
> >>>>>
> >>>>> File  /data/dev/path_to/dump.json
> >>>>>
> >>>>> Record  4009
> >>>>>
> >>>>> Fragment 0:0
> >>>>>
> >>>>>
> >>>>>
> >>>>> Another error   But what does this one mean? Ok, now that I have been
> >>>>> living in the docs and in the Drill user list, and because it’s
> >> similar
> >>>> to
> >>>>> the schema change issue, that that is what we are looking at here.
> >>>> Instead
> >>>>> of double to int, we have one field that is map most of the time,
> >> and in
> >>>>> some cases it’s a string.
> >>>>>
> >>>>>
> >>>>>
> >>>>> But this doesn’t really help me as a user.  To troubleshoot this
> >> Drill
> >>>>> doesn’t offer any options. This file is 500 MB of dense and nested
> >> JSON
> >>>>> data with 51k records.   My solution? I took the record number, then
> >> I
> >>>> went
> >>>>> to my NFS mounted clustered file system (thank goodness I had MapR
> >> here,
> >>>> I
> >>>>> am not sure how I would have done this with Posix tools)
> >>>>>
> >>>>>
> >>>>>
> >>>>> My command: $ head -4009 dump.json|tail -1
> >>>>>
> >>>>>
> >>>>>
> >>>>> That (I hoped) showed me the record in question, note the error from
> >>>> Drill
> >>>>> didn’t tell me which field was at fault here, so I had to visually
> >> align
> >>>>> things to address that.  However, I was able to spot the difference
> >> and
> >>>>> work with the dev to understand why that happened. I removed those
> >>>> records,
> >>>>> and things worked correctly.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Could there have been a way to identify that within drill? My
> >> solution
> >>>> was
> >>>>> to take a python script and read through, and discard those records
> >> that
> >>>>> were not a map, however, on 500MB that can work, but what about 500
> >> GB?
> >>>> I
> >>>>> guess a Spark job could clean the data…. But could Drill be given
> >> some
> >>>>> tools to help with this situation?
> >>>>>
> >>>>>
> >>>>>
> >>>>> For example, the first thing I said was: What field is at issue?  I
> >> had
> >>>> no
> >>>>> way to see what was up there.  I had to use other tools to see the
> >> data
> >>>> so
> >>>>> I could understand the problem. Then when I understood the problem,
> >> I had
> >>>>> to use Python to produce data that was queryable.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Based on the design document Hsuan Yi Chu just posted to the mailing
> >>>> list,
> >>>>> at this point my post is just a user story to support the design
> >>>> document.
> >>>>> To summarize the points I’d like to see included in the design
> >> document
> >>>>> (from a user perspective), not understanding “how or why”:
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> *1.     **Error messages that are more verbose in explaining the
> >> problem*
> >>>>>
> >>>>> a.     Filename, row number, column number or name
> >>>>>
> >>>>> b.     Option to output the “offending row”
> >>>>>
> >>>>> c.     Showing the data that is causing the error WITH the type Drill
> >>>>> inferred.
> >>>>>
> >>>>> d.     If there are options to help work through dirty data, perhaps
> >> the
> >>>>> error message could include those: “Data was an double, then drill
> >> found
> >>>>> this data: 0 that was a int in File x, at row 24 in column
> >>>> “myfloatingdata”
> >>>>> consider using store.json.read_numbers_as_double to address the
> >> issue.
> >>>>>
> >>>>> 2.     *A way to determine how common this exception is*
> >>>>>
> >>>>> a.     If I am playing with a messy data set, and this error happens,
> >>>> does
> >>>>> it happen on 1 record? 2? 5000?  Knowing that information would:
> >>>>>
> >>>>>                                               i.     Help users
> >>>> understand
> >>>>> how Drill is seeing that particular column
> >>>>>
> >>>>>                                             ii.     Make decisions
> >> on
> >>>>> excluding data rather than just removing it. What if the first 10
> >> records
> >>>>> were errors, and then you excluded the remaining 10 million because
> >> they
> >>>>> were correct yet different from the first 10?
> >>>>>
> >>>>> b.     Perhaps there could be a “stats” function that only works if
> >> it’s
> >>>>> the only selected item or if the select is all those functions (stats
> >>>>> functions)?
> >>>>>
> >>>>>                                               i.     Select
> >>>>> type_stats(fieldsname) from data
> >>>>>
> >>>>>                                             ii.      (that wouldn’t
> >>>> error
> >>>>> on different types)
> >>>>>
> >>>>> 3.     *An ability to set a “return null on this field if error or
> >> if non
> >>>>> castable to X type, especially in a view, perhaps in a function.*
> >>>>>
> >>>>> a.     Allow them to not have to reparse data outside drill
> >>>>>
> >>>>> b.     Load it into a sane format (one time loads/ETL to clean data)
> >>>>>
> >>>>> c.     Not be system or session wide exception.
> >>>>>
> >>>>>                                               i.     I think this is
> >>>>> important because I may have a field where I want it to read the
> >> numbers
> >>>> as
> >>>>> double, but what if I have another field in the same dataset where I
> >>>> don’t
> >>>>> want it to read the numbers as double? A SYSTEM or SESSION level
> >> variable
> >>>>> takes away that granularity
> >>>>>
> >>>>> d.     Select field1, CASTORNULL(field2, int) as field2,
> >>>> CASTORNULL(field3,
> >>>>> double) as field3 from ugly_data.
> >>>>>
> >>>>> e.     That’s an example when it’s in the select, but I Could see a
> >> where
> >>>>> clause
> >>>>>
> >>>>> f.      Select field1, field2, field3 from ugly data where
> >> ISTYPE(field2,
> >>>>> int) and ISTYPE(field3, double)
> >>>>>
> >>>>> 4.     *Updating of the documentation related to ALTER SESSION vs
> >> ALTER
> >>>>> SYSTEM with an eye to the context of the majority use case of the
> >>>>> documented feature*
> >>>>>
> >>>>> a.     For data loads, the documentation uses ALTER SYSTEM and that’s
> >>>>> problematic because:
> >>>>>
> >>>>>                                               i.     Not all users
> >> have
> >>>>> the privileges to issue an ALTER SYSTEM. Thus a new user trying to
> >> figure
> >>>>> things out may not realize they can just ALTER SESSION after getting
> >> an
> >>>>> ALTER SYSTEM error.
> >>>>>
> >>>>>                                             ii.     ALTER SYSTEM on
> >> data
> >>>>> loading items, especially in areas that make Drill’s data
> >> interpretation
> >>>>> more permissive can lead to unintended consequences later. An admin,
> >> who
> >>>>> may be a good systems admin, and helps a data user troubleshoot and
> >> error
> >>>>> may issue an ALTER SYSTEM not realizing this changes all future data
> >>>>> imports.
> >>>>>
> >>>>> b.     Note, I found a few cases, but I would suggest a thorough
> >> review
> >>>> of
> >>>>> the various use cases throughout the documentation, and in areas
> >> where it
> >>>>> really could be either, have a small paragraph indicating the
> >>>> ramifications
> >>>>> of either command.
> >>>>>
> >>>>> *5.     **A Philosophy within the Drill Community to “Stay in Drill”
> >> for
> >>>>> data exploration*
> >>>>>
> >>>>> a.     This is obviously not as much of a development thing as a
> >> mindset.
> >>>>> If someone says “I tried to do X, and I got and error” and the
> >>>> communities
> >>>>> response is Y where Y is “Look through your data and do Z to it so
> >> Drill
> >>>>> can read it” then we should reconsider that scenario and try to
> >> provide
> >>>> and
> >>>>> option within Drill to intuitively handle the edge case.  This is
> >>>>> difficult.
> >>>>>
> >>>>> b.     There are cases even in the documentation where this is the
> >> case:
> >>>>> https://drill.apache.org/docs/json-data-model/ <
> >> https://drill.apache.org/docs/json-data-model/> talking about arrays at
> >>>> the
> >>>>> root level or reading some empty arrays.  In these cases, we have to
> >>>> leave
> >>>>> drill to fix the problem. This works on small data, but may not work
> >> on
> >>>>> large or wide data. Consider the  array at root level limitation.
> >> What
> >>>> if
> >>>>> some process out of the users control produces 1000 100mb json files
> >> and
> >>>> we
> >>>>> want to read that. To fix it, we have to address those files. Lots of
> >>>> work
> >>>>> there, either manual or automated.
> >>>>>
> >>>>> c.     Once again I know this isn’t easy, but we shouldn’t answer
> >>>> questions
> >>>>> about how to do something by saying “fix this outside of Drill so
> >> Drill
> >>>> can
> >>>>> read your data” if at all possible.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> I hope this story helps support the design document presented.  I am
> >>>> happy
> >>>>> to participate in more discussion around these topics as I have
> >> enjoying
> >>>>> digging into the internals of Drill
> >>>>>
> >>>>>
> >>>>>
> >>>>> John Omernik
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> *Name: Ganesh Semalty*
> >>>> *Location: Gurgaon,Haryana(India)*
> >>>> *Email Id: g4ganeshsemalty@gmail.com <mailto:
> g4ganeshsemalty@gmail.com>
> >> <g4ganeshsemalty@gmail.com <ma...@gmail.com>>*
> >>>>
> >>>>
> >>>> P
> >>>>
> >>>> *Please consider the environment before printing this e-mail - SAVE
> >> TREE.*
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Name: Ganesh Semalty
> >>> Location: Gurgaon,Haryana(India)
> >>> Email Id: g4ganeshsemalty@gmail.com <ma...@gmail.com>
> >>>
> >>> P
> >>> Please consider the environment before printing this e-mail - SAVE
> TREE.
> >>
> >>
> >
> >
> > --
> > *Name: Ganesh Semalty*
> > *Location: Gurgaon,Haryana(India)*
> > *Email Id: g4ganeshsemalty@gmail.com <g4...@gmail.com>*
> >
> >
> > P
> >
> > *Please consider the environment before printing this e-mail - SAVE
> TREE.*
>
>


-- 
*Name: Ganesh Semalty*
*Location: Gurgaon,Haryana(India)*
*Email Id: g4ganeshsemalty@gmail.com <g4...@gmail.com>*


P

*Please consider the environment before printing this e-mail - SAVE TREE.*

Re: New Topic: Drill Visualization and Use Support

Posted by Andries Engelbrecht <ae...@maprtech.com>.

See if the videos on this page helps you.

https://www.mapr.com/products/apache-drill <https://www.mapr.com/products/apache-drill>




—Andries


> On Oct 22, 2015, at 7:38 AM, ganesh <g4...@gmail.com> wrote:
> 
> Hello,
> 
> Are there any other links as tutorial for TABLEAU v/s HIVE
> 
> I have already gone through the one in apache-drill site. I am not able to
> proceed with those.
> 
> On Thu, Oct 22, 2015 at 8:03 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
> 
>> Hive should be visible and usable in Tableau. You can use Drill Views for
>> dfs data, or you can Tableau Custom SQL to access the data.
>> 
>> Make sure to install the Tableau TDC file that comes with the ODBC driver.
>> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/ <
>> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/>
>> https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/ <
>> https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/>
>> 
>> 
>> —Andries
>> 
>> 
>>> On Oct 22, 2015, at 6:40 AM, ganesh <g4...@gmail.com> wrote:
>>> 
>>> Hi John,
>>> 
>>> Thanks for suggesting new name:  Apache Zeppelin
>>> 
>>> I was currently trying 14 Days trial version of TABLEAU with not much
>> success.
>>> Today only I knew that for files in Hadoop or local file system, I would
>> need to create view.
>>> 
>>> Still, though I can see my Tables from HIVE in Tableau, I cannot see any
>> data.
>>> Nor I am able to use TABLEAU from the links help given in apache-drill
>> currently (http://drill.apache.org/docs/tableau-examples/ <
>> http://drill.apache.org/docs/tableau-examples/>)
>>> 
>>> Snapshot attached, Incase you have worked over TABLEAU
>>> 
>>> I will look into Zeppelin also.
>>> 
>>> 
>>> On Thu, Oct 22, 2015 at 6:43 PM, John Omernik <john@omernik.com <mailto:
>> john@omernik.com>> wrote:
>>> I separated my response from the original topic to keep any responses
>> there
>>> focused on the design document.
>>> 
>>> As to ways to use Drill, I have been working with SQL Squirrel quite
>>> successfully.  On my list of things to do, if you want to stay in the
>>> Apache world is looking at Apache Zeppelin.  In the Git Repo, there is a
>>> Drill Plugin so you can run SQL again Drill, look at results, and do
>> basic
>>> visualizations, I have been trying to way until the PR is merged on
>>> Zeppelin, but for your use case, you may want to grab the plugin code and
>>> run with it.
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Oct 22, 2015 at 6:06 AM, ganesh <g4ganeshsemalty@gmail.com
>> <ma...@gmail.com>> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> John, you seem to be quite impressed with apache drill .. nice.
>>>> I am new to un-structured world and just started 1 week back on APACHE
>>>> DRILL after suggestion from my collegues. We have a semi structured
>> data
>>>> where we have constraint that we do not know number of columns
>>>> 
>>>> I heard that APACHE DRILL is column free applicationa nd with support
>> of
>>>> JSON format, it allows to create columns on-fly,
>>>> I converted my data from CSV-like-format to JSON and trying to figure
>> out
>>>> if it will work for me.
>>>> 
>>>> Here I hit two issues :-
>>>> 1) My column were like : 3100.2.1.2 <tel:3100.2.1.2>  and values like
>> '-2303" or
>>>> '01/01/2015
>>>> 02:02:00"
>>>> 
>>>> Challenge was that column cant be started with Numeric value. So I had
>> to
>>>> change key as: "t3100.2.1.2"
>>>> After that things were quite OK,
>>>> 
>>>> Now I need some help from you guys. To proceed I have to present my
>> work to
>>>> management as an example.
>>>> But querying on apache drill console, doesnt seem to be an attractive
>> way
>>>> to present things.
>>>> 
>>>> I tried drill explorer too.But didnt find that so good.
>>>> One thing to note, I am playing with files on Hadoop standalone mode in
>>>> UBUNTU.
>>>> 
>>>> To make it appear more good looking, I started with QLIK SENSE .. but
>> was
>>>> unable to connect it with hadoop file system. It only showed me HIVE
>> FILES.
>>>> Then I downloaded TABLEAU Trial version ... but I am unable to get
>> Hadoop
>>>> data here too...
>>>> 
>>>> Please help me how to proceed. I have presentation on coming Monday.
>>>> Queries are quite ready .. I just need to show in visualization form
>>>> ........ using OPEN SOURCE applications only.
>>>> 
>>>> 
>>>> Guys please help me.
>>>> 
>>>> 
>>>> 
>>>> On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <john@omernik.com
>> <ma...@omernik.com>> wrote:
>>>> 
>>>>> AWESOME!
>>>>> 
>>>>> I had just been in the process of writing up a long user story to
>> ask for
>>>>> and support exactly this.   I modified it and included it here:
>>>>> 
>>>>> 
>>>>> To start out, I want to say how much I love the Drill project, and
>> the
>>>>> potential it has. I've put this together based on my experiences and
>> want
>>>>> to contribute a perspective as a user, not just put a bunch of
>> critiques
>>>> in
>>>>> an email.  I hope it's all taken in that spirit.  Additional note, I
>>>> wrote
>>>>> this prior to seeing the Design Document share by Hsuan Yi Chu
>> yesterday.
>>>>> If you are reading it, and think to yourself “that wording is odd…”
>>>> please
>>>>> consider it from the “I didn’t want to throw away the user story”
>>>>> perspective and the “I wrote it before the design doc” perspective.
>>>>> 
>>>>> 
>>>>> 
>>>>> Additionally, I understand that some of what I am suggesting may not
>> be
>>>>> easy from a development perspective.  I am just being upfront with my
>>>>> experience, so we can look to determine what can be done; I am not
>>>> looking
>>>>> for a silver bullet here, just looking for improvement.  Some may be
>> as
>>>>> simple as better documentation, other suggestions may be harder to
>>>>> implement.  Either way, I thought a verbose user story might be
>> useful to
>>>>> the community as a whole.
>>>>> 
>>>>> 
>>>>> 
>>>>> John
>>>>> 
>>>>> 
>>>>> 
>>>>> *User Story*
>>>>> 
>>>>> 
>>>>> 
>>>>> As I have been working with Drill for data exploration, I came across
>>>>> multiple "things" that just were hard.  In dealing with some data,
>>>>> especially JSON data, it can be ugly, and scaled ugly is even worse!
>>>>> 
>>>>> 
>>>>> 
>>>>> For this story, I am working with a JSON dump from MongoDB, and you
>> would
>>>>> think it would be well structured, and for the most part it is.
>> There
>>>> are
>>>>> some application level mistakes that were made (I will go into that
>> in a
>>>>> moment), but in general Drill handles this well.  So with this data
>> set,
>>>>> there are a few main challenges I am seeing:
>>>>> 
>>>>> 
>>>>> 
>>>>> 1.     When there is a field that has a float, and then a later
>> record
>>>> has
>>>>> the number 0 in it (which Drill takes as a INT). This is a known
>> problem
>>>>> and one that Drill has a solution for.
>>>>> 
>>>>> 2.     When there is a field is of one type (a map) and then a later
>>>> record
>>>>> has a string in it.  No easy solution here.
>>>>> 
>>>>> 3.     Select * where there is a json field with a . in the name. I
>> won’t
>>>>> go into details here, but I feel this factors into data exploration,
>>>>> because it changes the ability to “stay in Drill” to explore their
>> data (
>>>>> https://issues.apache.org/jira/browse/DRILL-3922 <
>> https://issues.apache.org/jira/browse/DRILL-3922>)
>>>>> 
>>>>> 4.     Error reporting challenges
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> With the problem summary laid out, I wanted to walk through my
>> process in
>>>>> working with this data, and where, if I were a user Drill could have
>> been
>>>>> much more helpful to the process.
>>>>> 
>>>>> 
>>>>> 
>>>>> Here is a description of the process I went through:
>>>>> 
>>>>> 
>>>>> 
>>>>> 1.     Copy data into filesystem
>>>>> 
>>>>> 2.     Use drill to “Select * from `path_to/dump.json` limit 1
>>>>> 
>>>>> 3.     (I just want to see what it looks like!)
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Here I get this error:
>>>>> 
>>>>> 
>>>>> 
>>>>>> select * from `path_to/ dump.json` limit 1;
>>>>> 
>>>>> Error: DATA_READ ERROR: You tried to write a BigInt type when you are
>>>> using
>>>>> a ValueWriter of type NullableFloat8WriterImpl.
>>>>> 
>>>>> 
>>>>> 
>>>>> File  /data/dev/path_to/dump.json
>>>>> 
>>>>> Record  1
>>>>> 
>>>>> Line  1
>>>>> 
>>>>> Column  9054
>>>>> 
>>>>> Field  entropy
>>>>> 
>>>>> Fragment 0:0
>>>>> 
>>>>> 
>>>>> 
>>>>> This isn’t incredibly helpful from a user perspective.  I.e. When I
>>>> Google
>>>>> around, I realize now that in the docs it talks about “Schema
>> Changes”
>>>> and
>>>>> one possible item is use the setting below. However, examples of the
>> data
>>>>> that was trying to be displayed (with it’s implied type) may help
>> users
>>>>> grok what is happening.  At least in this case it showed me the field
>>>> name!
>>>>> 
>>>>> 
>>>>> 
>>>>> ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
>>>>> 
>>>>> 
>>>>> 
>>>>> This is a great example where since we have known use case (when
>> numbers
>>>>> are doubles but someone tries to store 0 an INT) it fails, thus dev’s
>>>> have
>>>>> added a setting to allow a user to get through that, that the error
>>>> message
>>>>> could be more helpful.   In this case, Showing two record numbers
>> (line
>>>>> numbers) with different types, the field values with their implied
>> types,
>>>>> and perhaps a suggestion about using the setting to address the
>> problem.
>>>>> This could make it more intuitive for the user to stay in Drill, and
>> stay
>>>>> in the data.   In this case, I looked at a head of the file, and saw
>> the
>>>>> issue and was able to proceed.
>>>>> 
>>>>> 
>>>>> 
>>>>> Also, as a corollary here, the user documentation does not show this
>>>> error
>>>>> related to the schema change problem. This would be a great place to
>>>> state,
>>>>> “if you see an error that looks like X, this is what is happening and
>>>> what
>>>>> you can do for it.”
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> *Side node on documentation*
>>>>> 
>>>>> We should look to have documentation try to be role based.   In this
>>>> case,
>>>>> the documentation says use “ALTER SYSTEM” I would argue, and I am
>>>> guessing
>>>>> others would concur, that for this use case, “ALTER SESSION” may be a
>>>>> better suggestion as this is specific alteration to address the use
>> case
>>>> of
>>>>> loading/querying a specific data set, and is likely done by a user
>> of the
>>>>> system.
>>>>> 
>>>>> 
>>>>> 
>>>>> If a user is doing self-serve data, then in an enterprise
>> environment,
>>>> they
>>>>> may not have the ability to use ALTER SYSTEM and get an error, thus
>> may
>>>> be
>>>>> confused on how to proceed.   In addition ALTER SYSTEM by a user who
>>>>> doesn’t understand that they are changing, yet have the rights to
>> change,
>>>>> may introduce future data problems they didn’t expect.   I like that
>> the
>>>>> default is a more constrictive method, because it makes people be
>>>> explicit
>>>>> about data, yet the documentation should also aim to be explicit
>> about
>>>>> something like a system wide change.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> *Back to the story*
>>>>> 
>>>>> Ok so now I will do ALTER SESSION SET on the read_numbers_as_double
>>>> setting
>>>>> 
>>>>> 
>>>>> 
>>>>> I run the query again.
>>>>> 
>>>>> 
>>>>> 
>>>>>> select * from `path_to/dump.json` limit 1;
>>>>> 
>>>>> Error: DATA_READ ERROR: Error parsing JSON - You tried to write a
>> VarChar
>>>>> type when you are using a ValueWriter of type SingleMapWriter.
>>>>> 
>>>>> 
>>>>> 
>>>>> File  /data/dev/path_to/dump.json
>>>>> 
>>>>> Record  4009
>>>>> 
>>>>> Fragment 0:0
>>>>> 
>>>>> 
>>>>> 
>>>>> Another error   But what does this one mean? Ok, now that I have been
>>>>> living in the docs and in the Drill user list, and because it’s
>> similar
>>>> to
>>>>> the schema change issue, that that is what we are looking at here.
>>>> Instead
>>>>> of double to int, we have one field that is map most of the time,
>> and in
>>>>> some cases it’s a string.
>>>>> 
>>>>> 
>>>>> 
>>>>> But this doesn’t really help me as a user.  To troubleshoot this
>> Drill
>>>>> doesn’t offer any options. This file is 500 MB of dense and nested
>> JSON
>>>>> data with 51k records.   My solution? I took the record number, then
>> I
>>>> went
>>>>> to my NFS mounted clustered file system (thank goodness I had MapR
>> here,
>>>> I
>>>>> am not sure how I would have done this with Posix tools)
>>>>> 
>>>>> 
>>>>> 
>>>>> My command: $ head -4009 dump.json|tail -1
>>>>> 
>>>>> 
>>>>> 
>>>>> That (I hoped) showed me the record in question, note the error from
>>>> Drill
>>>>> didn’t tell me which field was at fault here, so I had to visually
>> align
>>>>> things to address that.  However, I was able to spot the difference
>> and
>>>>> work with the dev to understand why that happened. I removed those
>>>> records,
>>>>> and things worked correctly.
>>>>> 
>>>>> 
>>>>> 
>>>>> Could there have been a way to identify that within drill? My
>> solution
>>>> was
>>>>> to take a python script and read through, and discard those records
>> that
>>>>> were not a map, however, on 500MB that can work, but what about 500
>> GB?
>>>> I
>>>>> guess a Spark job could clean the data…. But could Drill be given
>> some
>>>>> tools to help with this situation?
>>>>> 
>>>>> 
>>>>> 
>>>>> For example, the first thing I said was: What field is at issue?  I
>> had
>>>> no
>>>>> way to see what was up there.  I had to use other tools to see the
>> data
>>>> so
>>>>> I could understand the problem. Then when I understood the problem,
>> I had
>>>>> to use Python to produce data that was queryable.
>>>>> 
>>>>> 
>>>>> 
>>>>> Based on the design document Hsuan Yi Chu just posted to the mailing
>>>> list,
>>>>> at this point my post is just a user story to support the design
>>>> document.
>>>>> To summarize the points I’d like to see included in the design
>> document
>>>>> (from a user perspective), not understanding “how or why”:
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> *1.     **Error messages that are more verbose in explaining the
>> problem*
>>>>> 
>>>>> a.     Filename, row number, column number or name
>>>>> 
>>>>> b.     Option to output the “offending row”
>>>>> 
>>>>> c.     Showing the data that is causing the error WITH the type Drill
>>>>> inferred.
>>>>> 
>>>>> d.     If there are options to help work through dirty data, perhaps
>> the
>>>>> error message could include those: “Data was an double, then drill
>> found
>>>>> this data: 0 that was a int in File x, at row 24 in column
>>>> “myfloatingdata”
>>>>> consider using store.json.read_numbers_as_double to address the
>> issue.
>>>>> 
>>>>> 2.     *A way to determine how common this exception is*
>>>>> 
>>>>> a.     If I am playing with a messy data set, and this error happens,
>>>> does
>>>>> it happen on 1 record? 2? 5000?  Knowing that information would:
>>>>> 
>>>>>                                               i.     Help users
>>>> understand
>>>>> how Drill is seeing that particular column
>>>>> 
>>>>>                                             ii.     Make decisions
>> on
>>>>> excluding data rather than just removing it. What if the first 10
>> records
>>>>> were errors, and then you excluded the remaining 10 million because
>> they
>>>>> were correct yet different from the first 10?
>>>>> 
>>>>> b.     Perhaps there could be a “stats” function that only works if
>> it’s
>>>>> the only selected item or if the select is all those functions (stats
>>>>> functions)?
>>>>> 
>>>>>                                               i.     Select
>>>>> type_stats(fieldsname) from data
>>>>> 
>>>>>                                             ii.      (that wouldn’t
>>>> error
>>>>> on different types)
>>>>> 
>>>>> 3.     *An ability to set a “return null on this field if error or
>> if non
>>>>> castable to X type, especially in a view, perhaps in a function.*
>>>>> 
>>>>> a.     Allow them to not have to reparse data outside drill
>>>>> 
>>>>> b.     Load it into a sane format (one time loads/ETL to clean data)
>>>>> 
>>>>> c.     Not be system or session wide exception.
>>>>> 
>>>>>                                               i.     I think this is
>>>>> important because I may have a field where I want it to read the
>> numbers
>>>> as
>>>>> double, but what if I have another field in the same dataset where I
>>>> don’t
>>>>> want it to read the numbers as double? A SYSTEM or SESSION level
>> variable
>>>>> takes away that granularity
>>>>> 
>>>>> d.     Select field1, CASTORNULL(field2, int) as field2,
>>>> CASTORNULL(field3,
>>>>> double) as field3 from ugly_data.
>>>>> 
>>>>> e.     That’s an example when it’s in the select, but I Could see a
>> where
>>>>> clause
>>>>> 
>>>>> f.      Select field1, field2, field3 from ugly data where
>> ISTYPE(field2,
>>>>> int) and ISTYPE(field3, double)
>>>>> 
>>>>> 4.     *Updating of the documentation related to ALTER SESSION vs
>> ALTER
>>>>> SYSTEM with an eye to the context of the majority use case of the
>>>>> documented feature*
>>>>> 
>>>>> a.     For data loads, the documentation uses ALTER SYSTEM and that’s
>>>>> problematic because:
>>>>> 
>>>>>                                               i.     Not all users
>> have
>>>>> the privileges to issue an ALTER SYSTEM. Thus a new user trying to
>> figure
>>>>> things out may not realize they can just ALTER SESSION after getting
>> an
>>>>> ALTER SYSTEM error.
>>>>> 
>>>>>                                             ii.     ALTER SYSTEM on
>> data
>>>>> loading items, especially in areas that make Drill’s data
>> interpretation
>>>>> more permissive can lead to unintended consequences later. An admin,
>> who
>>>>> may be a good systems admin, and helps a data user troubleshoot and
>> error
>>>>> may issue an ALTER SYSTEM not realizing this changes all future data
>>>>> imports.
>>>>> 
>>>>> b.     Note, I found a few cases, but I would suggest a thorough
>> review
>>>> of
>>>>> the various use cases throughout the documentation, and in areas
>> where it
>>>>> really could be either, have a small paragraph indicating the
>>>> ramifications
>>>>> of either command.
>>>>> 
>>>>> *5.     **A Philosophy within the Drill Community to “Stay in Drill”
>> for
>>>>> data exploration*
>>>>> 
>>>>> a.     This is obviously not as much of a development thing as a
>> mindset.
>>>>> If someone says “I tried to do X, and I got and error” and the
>>>> communities
>>>>> response is Y where Y is “Look through your data and do Z to it so
>> Drill
>>>>> can read it” then we should reconsider that scenario and try to
>> provide
>>>> and
>>>>> option within Drill to intuitively handle the edge case.  This is
>>>>> difficult.
>>>>> 
>>>>> b.     There are cases even in the documentation where this is the
>> case:
>>>>> https://drill.apache.org/docs/json-data-model/ <
>> https://drill.apache.org/docs/json-data-model/> talking about arrays at
>>>> the
>>>>> root level or reading some empty arrays.  In these cases, we have to
>>>> leave
>>>>> drill to fix the problem. This works on small data, but may not work
>> on
>>>>> large or wide data. Consider the  array at root level limitation.
>> What
>>>> if
>>>>> some process out of the users control produces 1000 100mb json files
>> and
>>>> we
>>>>> want to read that. To fix it, we have to address those files. Lots of
>>>> work
>>>>> there, either manual or automated.
>>>>> 
>>>>> c.     Once again I know this isn’t easy, but we shouldn’t answer
>>>> questions
>>>>> about how to do something by saying “fix this outside of Drill so
>> Drill
>>>> can
>>>>> read your data” if at all possible.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> I hope this story helps support the design document presented.  I am
>>>> happy
>>>>> to participate in more discussion around these topics as I have
>> enjoying
>>>>> digging into the internals of Drill
>>>>> 
>>>>> 
>>>>> 
>>>>> John Omernik
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> *Name: Ganesh Semalty*
>>>> *Location: Gurgaon,Haryana(India)*
>>>> *Email Id: g4ganeshsemalty@gmail.com <ma...@gmail.com>
>> <g4ganeshsemalty@gmail.com <ma...@gmail.com>>*
>>>> 
>>>> 
>>>> P
>>>> 
>>>> *Please consider the environment before printing this e-mail - SAVE
>> TREE.*
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Name: Ganesh Semalty
>>> Location: Gurgaon,Haryana(India)
>>> Email Id: g4ganeshsemalty@gmail.com <ma...@gmail.com>
>>> 
>>> P
>>> Please consider the environment before printing this e-mail - SAVE TREE.
>> 
>> 
> 
> 
> -- 
> *Name: Ganesh Semalty*
> *Location: Gurgaon,Haryana(India)*
> *Email Id: g4ganeshsemalty@gmail.com <g4...@gmail.com>*
> 
> 
> P
> 
> *Please consider the environment before printing this e-mail - SAVE TREE.*

Re: New Topic: Drill Visualization and Use Support

Posted by ganesh <g4...@gmail.com>.

Hello,

Are there any other links as tutorial for TABLEAU v/s HIVE

I have already gone through the one in apache-drill site. I am not able to
proceed with those.

On Thu, Oct 22, 2015 at 8:03 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> Hive should be visible and usable in Tableau. You can use Drill Views for
> dfs data, or you can Tableau Custom SQL to access the data.
>
> Make sure to install the Tableau TDC file that comes with the ODBC driver.
> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/ <
> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/>
> https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/ <
> https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/>
>
>
> —Andries
>
>
> > On Oct 22, 2015, at 6:40 AM, ganesh <g4...@gmail.com> wrote:
> >
> > Hi John,
> >
> > Thanks for suggesting new name:  Apache Zeppelin
> >
> > I was currently trying 14 Days trial version of TABLEAU with not much
> success.
> > Today only I knew that for files in Hadoop or local file system, I would
> need to create view.
> >
> > Still, though I can see my Tables from HIVE in Tableau, I cannot see any
> data.
> > Nor I am able to use TABLEAU from the links help given in apache-drill
> currently (http://drill.apache.org/docs/tableau-examples/ <
> http://drill.apache.org/docs/tableau-examples/>)
> >
> > Snapshot attached, Incase you have worked over TABLEAU
> >
> > I will look into Zeppelin also.
> >
> >
> > On Thu, Oct 22, 2015 at 6:43 PM, John Omernik <john@omernik.com <mailto:
> john@omernik.com>> wrote:
> > I separated my response from the original topic to keep any responses
> there
> > focused on the design document.
> >
> > As to ways to use Drill, I have been working with SQL Squirrel quite
> > successfully.  On my list of things to do, if you want to stay in the
> > Apache world is looking at Apache Zeppelin.  In the Git Repo, there is a
> > Drill Plugin so you can run SQL again Drill, look at results, and do
> basic
> > visualizations, I have been trying to way until the PR is merged on
> > Zeppelin, but for your use case, you may want to grab the plugin code and
> > run with it.
> >
> >
> >
> >
> > On Thu, Oct 22, 2015 at 6:06 AM, ganesh <g4ganeshsemalty@gmail.com
> <ma...@gmail.com>> wrote:
> >
> > > Hello,
> > >
> > > John, you seem to be quite impressed with apache drill .. nice.
> > > I am new to un-structured world and just started 1 week back on APACHE
> > > DRILL after suggestion from my collegues. We have a semi structured
> data
> > > where we have constraint that we do not know number of columns
> > >
> > > I heard that APACHE DRILL is column free applicationa nd with support
> of
> > > JSON format, it allows to create columns on-fly,
> > > I converted my data from CSV-like-format to JSON and trying to figure
> out
> > > if it will work for me.
> > >
> > > Here I hit two issues :-
> > > 1) My column were like : 3100.2.1.2 <tel:3100.2.1.2>  and values like
> '-2303" or
> > > '01/01/2015
> > > 02:02:00"
> > >
> > > Challenge was that column cant be started with Numeric value. So I had
> to
> > > change key as: "t3100.2.1.2"
> > > After that things were quite OK,
> > >
> > > Now I need some help from you guys. To proceed I have to present my
> work to
> > > management as an example.
> > > But querying on apache drill console, doesnt seem to be an attractive
> way
> > > to present things.
> > >
> > > I tried drill explorer too.But didnt find that so good.
> > > One thing to note, I am playing with files on Hadoop standalone mode in
> > > UBUNTU.
> > >
> > > To make it appear more good looking, I started with QLIK SENSE .. but
> was
> > > unable to connect it with hadoop file system. It only showed me HIVE
> FILES.
> > > Then I downloaded TABLEAU Trial version ... but I am unable to get
> Hadoop
> > > data here too...
> > >
> > > Please help me how to proceed. I have presentation on coming Monday.
> > > Queries are quite ready .. I just need to show in visualization form
> > > ........ using OPEN SOURCE applications only.
> > >
> > >
> > > Guys please help me.
> > >
> > >
> > >
> > > On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <john@omernik.com
> <ma...@omernik.com>> wrote:
> > >
> > > > AWESOME!
> > > >
> > > > I had just been in the process of writing up a long user story to
> ask for
> > > > and support exactly this.   I modified it and included it here:
> > > >
> > > >
> > > > To start out, I want to say how much I love the Drill project, and
> the
> > > > potential it has. I've put this together based on my experiences and
> want
> > > > to contribute a perspective as a user, not just put a bunch of
> critiques
> > > in
> > > > an email.  I hope it's all taken in that spirit.  Additional note, I
> > > wrote
> > > > this prior to seeing the Design Document share by Hsuan Yi Chu
> yesterday.
> > > > If you are reading it, and think to yourself “that wording is odd…”
> > > please
> > > > consider it from the “I didn’t want to throw away the user story”
> > > > perspective and the “I wrote it before the design doc” perspective.
> > > >
> > > >
> > > >
> > > > Additionally, I understand that some of what I am suggesting may not
> be
> > > > easy from a development perspective.  I am just being upfront with my
> > > > experience, so we can look to determine what can be done; I am not
> > > looking
> > > > for a silver bullet here, just looking for improvement.  Some may be
> as
> > > > simple as better documentation, other suggestions may be harder to
> > > > implement.  Either way, I thought a verbose user story might be
> useful to
> > > > the community as a whole.
> > > >
> > > >
> > > >
> > > > John
> > > >
> > > >
> > > >
> > > > *User Story*
> > > >
> > > >
> > > >
> > > > As I have been working with Drill for data exploration, I came across
> > > > multiple "things" that just were hard.  In dealing with some data,
> > > > especially JSON data, it can be ugly, and scaled ugly is even worse!
> > > >
> > > >
> > > >
> > > > For this story, I am working with a JSON dump from MongoDB, and you
> would
> > > > think it would be well structured, and for the most part it is.
> There
> > > are
> > > > some application level mistakes that were made (I will go into that
> in a
> > > > moment), but in general Drill handles this well.  So with this data
> set,
> > > > there are a few main challenges I am seeing:
> > > >
> > > >
> > > >
> > > > 1.     When there is a field that has a float, and then a later
> record
> > > has
> > > > the number 0 in it (which Drill takes as a INT). This is a known
> problem
> > > > and one that Drill has a solution for.
> > > >
> > > > 2.     When there is a field is of one type (a map) and then a later
> > > record
> > > > has a string in it.  No easy solution here.
> > > >
> > > > 3.     Select * where there is a json field with a . in the name. I
> won’t
> > > > go into details here, but I feel this factors into data exploration,
> > > > because it changes the ability to “stay in Drill” to explore their
> data (
> > > > https://issues.apache.org/jira/browse/DRILL-3922 <
> https://issues.apache.org/jira/browse/DRILL-3922>)
> > > >
> > > > 4.     Error reporting challenges
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > With the problem summary laid out, I wanted to walk through my
> process in
> > > > working with this data, and where, if I were a user Drill could have
> been
> > > > much more helpful to the process.
> > > >
> > > >
> > > >
> > > > Here is a description of the process I went through:
> > > >
> > > >
> > > >
> > > > 1.     Copy data into filesystem
> > > >
> > > > 2.     Use drill to “Select * from `path_to/dump.json` limit 1
> > > >
> > > > 3.     (I just want to see what it looks like!)
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Here I get this error:
> > > >
> > > >
> > > >
> > > > > select * from `path_to/ dump.json` limit 1;
> > > >
> > > > Error: DATA_READ ERROR: You tried to write a BigInt type when you are
> > > using
> > > > a ValueWriter of type NullableFloat8WriterImpl.
> > > >
> > > >
> > > >
> > > > File  /data/dev/path_to/dump.json
> > > >
> > > > Record  1
> > > >
> > > > Line  1
> > > >
> > > > Column  9054
> > > >
> > > > Field  entropy
> > > >
> > > > Fragment 0:0
> > > >
> > > >
> > > >
> > > > This isn’t incredibly helpful from a user perspective.  I.e. When I
> > > Google
> > > > around, I realize now that in the docs it talks about “Schema
> Changes”
> > > and
> > > > one possible item is use the setting below. However, examples of the
> data
> > > > that was trying to be displayed (with it’s implied type) may help
> users
> > > > grok what is happening.  At least in this case it showed me the field
> > > name!
> > > >
> > > >
> > > >
> > > > ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
> > > >
> > > >
> > > >
> > > > This is a great example where since we have known use case (when
> numbers
> > > > are doubles but someone tries to store 0 an INT) it fails, thus dev’s
> > > have
> > > > added a setting to allow a user to get through that, that the error
> > > message
> > > > could be more helpful.   In this case, Showing two record numbers
> (line
> > > > numbers) with different types, the field values with their implied
> types,
> > > > and perhaps a suggestion about using the setting to address the
> problem.
> > > > This could make it more intuitive for the user to stay in Drill, and
> stay
> > > > in the data.   In this case, I looked at a head of the file, and saw
> the
> > > > issue and was able to proceed.
> > > >
> > > >
> > > >
> > > > Also, as a corollary here, the user documentation does not show this
> > > error
> > > > related to the schema change problem. This would be a great place to
> > > state,
> > > > “if you see an error that looks like X, this is what is happening and
> > > what
> > > > you can do for it.”
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *Side node on documentation*
> > > >
> > > > We should look to have documentation try to be role based.   In this
> > > case,
> > > > the documentation says use “ALTER SYSTEM” I would argue, and I am
> > > guessing
> > > > others would concur, that for this use case, “ALTER SESSION” may be a
> > > > better suggestion as this is specific alteration to address the use
> case
> > > of
> > > > loading/querying a specific data set, and is likely done by a user
> of the
> > > > system.
> > > >
> > > >
> > > >
> > > > If a user is doing self-serve data, then in an enterprise
> environment,
> > > they
> > > > may not have the ability to use ALTER SYSTEM and get an error, thus
> may
> > > be
> > > > confused on how to proceed.   In addition ALTER SYSTEM by a user who
> > > > doesn’t understand that they are changing, yet have the rights to
> change,
> > > > may introduce future data problems they didn’t expect.   I like that
> the
> > > > default is a more constrictive method, because it makes people be
> > > explicit
> > > > about data, yet the documentation should also aim to be explicit
> about
> > > > something like a system wide change.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *Back to the story*
> > > >
> > > > Ok so now I will do ALTER SESSION SET on the read_numbers_as_double
> > > setting
> > > >
> > > >
> > > >
> > > > I run the query again.
> > > >
> > > >
> > > >
> > > > > select * from `path_to/dump.json` limit 1;
> > > >
> > > > Error: DATA_READ ERROR: Error parsing JSON - You tried to write a
> VarChar
> > > > type when you are using a ValueWriter of type SingleMapWriter.
> > > >
> > > >
> > > >
> > > > File  /data/dev/path_to/dump.json
> > > >
> > > > Record  4009
> > > >
> > > > Fragment 0:0
> > > >
> > > >
> > > >
> > > > Another error   But what does this one mean? Ok, now that I have been
> > > > living in the docs and in the Drill user list, and because it’s
> similar
> > > to
> > > > the schema change issue, that that is what we are looking at here.
> > > Instead
> > > > of double to int, we have one field that is map most of the time,
> and in
> > > > some cases it’s a string.
> > > >
> > > >
> > > >
> > > > But this doesn’t really help me as a user.  To troubleshoot this
> Drill
> > > > doesn’t offer any options. This file is 500 MB of dense and nested
> JSON
> > > > data with 51k records.   My solution? I took the record number, then
> I
> > > went
> > > > to my NFS mounted clustered file system (thank goodness I had MapR
> here,
> > > I
> > > > am not sure how I would have done this with Posix tools)
> > > >
> > > >
> > > >
> > > > My command: $ head -4009 dump.json|tail -1
> > > >
> > > >
> > > >
> > > > That (I hoped) showed me the record in question, note the error from
> > > Drill
> > > > didn’t tell me which field was at fault here, so I had to visually
> align
> > > > things to address that.  However, I was able to spot the difference
> and
> > > > work with the dev to understand why that happened. I removed those
> > > records,
> > > > and things worked correctly.
> > > >
> > > >
> > > >
> > > > Could there have been a way to identify that within drill? My
> solution
> > > was
> > > > to take a python script and read through, and discard those records
> that
> > > > were not a map, however, on 500MB that can work, but what about 500
> GB?
> > > I
> > > > guess a Spark job could clean the data…. But could Drill be given
> some
> > > > tools to help with this situation?
> > > >
> > > >
> > > >
> > > > For example, the first thing I said was: What field is at issue?  I
> had
> > > no
> > > > way to see what was up there.  I had to use other tools to see the
> data
> > > so
> > > > I could understand the problem. Then when I understood the problem,
> I had
> > > > to use Python to produce data that was queryable.
> > > >
> > > >
> > > >
> > > > Based on the design document Hsuan Yi Chu just posted to the mailing
> > > list,
> > > > at this point my post is just a user story to support the design
> > > document.
> > > > To summarize the points I’d like to see included in the design
> document
> > > > (from a user perspective), not understanding “how or why”:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *1.     **Error messages that are more verbose in explaining the
> problem*
> > > >
> > > > a.     Filename, row number, column number or name
> > > >
> > > > b.     Option to output the “offending row”
> > > >
> > > > c.     Showing the data that is causing the error WITH the type Drill
> > > > inferred.
> > > >
> > > > d.     If there are options to help work through dirty data, perhaps
> the
> > > > error message could include those: “Data was an double, then drill
> found
> > > > this data: 0 that was a int in File x, at row 24 in column
> > > “myfloatingdata”
> > > > consider using store.json.read_numbers_as_double to address the
> issue.
> > > >
> > > > 2.     *A way to determine how common this exception is*
> > > >
> > > > a.     If I am playing with a messy data set, and this error happens,
> > > does
> > > > it happen on 1 record? 2? 5000?  Knowing that information would:
> > > >
> > > >                                                i.     Help users
> > > understand
> > > > how Drill is seeing that particular column
> > > >
> > > >                                              ii.     Make decisions
> on
> > > > excluding data rather than just removing it. What if the first 10
> records
> > > > were errors, and then you excluded the remaining 10 million because
> they
> > > > were correct yet different from the first 10?
> > > >
> > > > b.     Perhaps there could be a “stats” function that only works if
> it’s
> > > > the only selected item or if the select is all those functions (stats
> > > > functions)?
> > > >
> > > >                                                i.     Select
> > > > type_stats(fieldsname) from data
> > > >
> > > >                                              ii.      (that wouldn’t
> > > error
> > > > on different types)
> > > >
> > > > 3.     *An ability to set a “return null on this field if error or
> if non
> > > > castable to X type, especially in a view, perhaps in a function.*
> > > >
> > > > a.     Allow them to not have to reparse data outside drill
> > > >
> > > > b.     Load it into a sane format (one time loads/ETL to clean data)
> > > >
> > > > c.     Not be system or session wide exception.
> > > >
> > > >                                                i.     I think this is
> > > > important because I may have a field where I want it to read the
> numbers
> > > as
> > > > double, but what if I have another field in the same dataset where I
> > > don’t
> > > > want it to read the numbers as double? A SYSTEM or SESSION level
> variable
> > > > takes away that granularity
> > > >
> > > > d.     Select field1, CASTORNULL(field2, int) as field2,
> > > CASTORNULL(field3,
> > > > double) as field3 from ugly_data.
> > > >
> > > > e.     That’s an example when it’s in the select, but I Could see a
> where
> > > > clause
> > > >
> > > > f.      Select field1, field2, field3 from ugly data where
> ISTYPE(field2,
> > > > int) and ISTYPE(field3, double)
> > > >
> > > > 4.     *Updating of the documentation related to ALTER SESSION vs
> ALTER
> > > > SYSTEM with an eye to the context of the majority use case of the
> > > > documented feature*
> > > >
> > > > a.     For data loads, the documentation uses ALTER SYSTEM and that’s
> > > > problematic because:
> > > >
> > > >                                                i.     Not all users
> have
> > > > the privileges to issue an ALTER SYSTEM. Thus a new user trying to
> figure
> > > > things out may not realize they can just ALTER SESSION after getting
> an
> > > > ALTER SYSTEM error.
> > > >
> > > >                                              ii.     ALTER SYSTEM on
> data
> > > > loading items, especially in areas that make Drill’s data
> interpretation
> > > > more permissive can lead to unintended consequences later. An admin,
> who
> > > > may be a good systems admin, and helps a data user troubleshoot and
> error
> > > > may issue an ALTER SYSTEM not realizing this changes all future data
> > > > imports.
> > > >
> > > > b.     Note, I found a few cases, but I would suggest a thorough
> review
> > > of
> > > > the various use cases throughout the documentation, and in areas
> where it
> > > > really could be either, have a small paragraph indicating the
> > > ramifications
> > > > of either command.
> > > >
> > > > *5.     **A Philosophy within the Drill Community to “Stay in Drill”
> for
> > > > data exploration*
> > > >
> > > > a.     This is obviously not as much of a development thing as a
> mindset.
> > > > If someone says “I tried to do X, and I got and error” and the
> > > communities
> > > > response is Y where Y is “Look through your data and do Z to it so
> Drill
> > > > can read it” then we should reconsider that scenario and try to
> provide
> > > and
> > > > option within Drill to intuitively handle the edge case.  This is
> > > > difficult.
> > > >
> > > > b.     There are cases even in the documentation where this is the
> case:
> > > > https://drill.apache.org/docs/json-data-model/ <
> https://drill.apache.org/docs/json-data-model/> talking about arrays at
> > > the
> > > > root level or reading some empty arrays.  In these cases, we have to
> > > leave
> > > > drill to fix the problem. This works on small data, but may not work
> on
> > > > large or wide data. Consider the  array at root level limitation.
> What
> > > if
> > > > some process out of the users control produces 1000 100mb json files
> and
> > > we
> > > > want to read that. To fix it, we have to address those files. Lots of
> > > work
> > > > there, either manual or automated.
> > > >
> > > > c.     Once again I know this isn’t easy, but we shouldn’t answer
> > > questions
> > > > about how to do something by saying “fix this outside of Drill so
> Drill
> > > can
> > > > read your data” if at all possible.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > I hope this story helps support the design document presented.  I am
> > > happy
> > > > to participate in more discussion around these topics as I have
> enjoying
> > > > digging into the internals of Drill
> > > >
> > > >
> > > >
> > > > John Omernik
> > > >
> > >
> > >
> > >
> > > --
> > > *Name: Ganesh Semalty*
> > > *Location: Gurgaon,Haryana(India)*
> > > *Email Id: g4ganeshsemalty@gmail.com <ma...@gmail.com>
> <g4ganeshsemalty@gmail.com <ma...@gmail.com>>*
> > >
> > >
> > > P
> > >
> > > *Please consider the environment before printing this e-mail - SAVE
> TREE.*
> > >
> >
> >
> >
> > --
> > Name: Ganesh Semalty
> > Location: Gurgaon,Haryana(India)
> > Email Id: g4ganeshsemalty@gmail.com <ma...@gmail.com>
> >
> > P
> > Please consider the environment before printing this e-mail - SAVE TREE.
>
>


-- 
*Name: Ganesh Semalty*
*Location: Gurgaon,Haryana(India)*
*Email Id: g4ganeshsemalty@gmail.com <g4...@gmail.com>*


P

*Please consider the environment before printing this e-mail - SAVE TREE.*

Re: New Topic: Drill Visualization and Use Support

Posted by Andries Engelbrecht <ae...@maprtech.com>.

Hive should be visible and usable in Tableau. You can use Drill Views for dfs data, or you can Tableau Custom SQL to access the data.

Make sure to install the Tableau TDC file that comes with the ODBC driver.
https://drill.apache.org/docs/installing-the-tdc-file-on-windows/ <https://drill.apache.org/docs/installing-the-tdc-file-on-windows/>
https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/ <https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/>


—Andries


> On Oct 22, 2015, at 6:40 AM, ganesh <g4...@gmail.com> wrote:
> 
> Hi John,
> 
> Thanks for suggesting new name:  Apache Zeppelin
> 
> I was currently trying 14 Days trial version of TABLEAU with not much success.
> Today only I knew that for files in Hadoop or local file system, I would need to create view.
> 
> Still, though I can see my Tables from HIVE in Tableau, I cannot see any data.
> Nor I am able to use TABLEAU from the links help given in apache-drill currently (http://drill.apache.org/docs/tableau-examples/ <http://drill.apache.org/docs/tableau-examples/>)
> 
> Snapshot attached, Incase you have worked over TABLEAU
> 
> I will look into Zeppelin also.
> 
> 
> On Thu, Oct 22, 2015 at 6:43 PM, John Omernik <john@omernik.com <ma...@omernik.com>> wrote:
> I separated my response from the original topic to keep any responses there
> focused on the design document.
> 
> As to ways to use Drill, I have been working with SQL Squirrel quite
> successfully.  On my list of things to do, if you want to stay in the
> Apache world is looking at Apache Zeppelin.  In the Git Repo, there is a
> Drill Plugin so you can run SQL again Drill, look at results, and do basic
> visualizations, I have been trying to way until the PR is merged on
> Zeppelin, but for your use case, you may want to grab the plugin code and
> run with it.
> 
> 
> 
> 
> On Thu, Oct 22, 2015 at 6:06 AM, ganesh <g4ganeshsemalty@gmail.com <ma...@gmail.com>> wrote:
> 
> > Hello,
> >
> > John, you seem to be quite impressed with apache drill .. nice.
> > I am new to un-structured world and just started 1 week back on APACHE
> > DRILL after suggestion from my collegues. We have a semi structured data
> > where we have constraint that we do not know number of columns
> >
> > I heard that APACHE DRILL is column free applicationa nd with support of
> > JSON format, it allows to create columns on-fly,
> > I converted my data from CSV-like-format to JSON and trying to figure out
> > if it will work for me.
> >
> > Here I hit two issues :-
> > 1) My column were like : 3100.2.1.2 <tel:3100.2.1.2>  and values like '-2303" or
> > '01/01/2015
> > 02:02:00"
> >
> > Challenge was that column cant be started with Numeric value. So I had to
> > change key as: "t3100.2.1.2"
> > After that things were quite OK,
> >
> > Now I need some help from you guys. To proceed I have to present my work to
> > management as an example.
> > But querying on apache drill console, doesnt seem to be an attractive way
> > to present things.
> >
> > I tried drill explorer too.But didnt find that so good.
> > One thing to note, I am playing with files on Hadoop standalone mode in
> > UBUNTU.
> >
> > To make it appear more good looking, I started with QLIK SENSE .. but was
> > unable to connect it with hadoop file system. It only showed me HIVE FILES.
> > Then I downloaded TABLEAU Trial version ... but I am unable to get Hadoop
> > data here too...
> >
> > Please help me how to proceed. I have presentation on coming Monday.
> > Queries are quite ready .. I just need to show in visualization form
> > ........ using OPEN SOURCE applications only.
> >
> >
> > Guys please help me.
> >
> >
> >
> > On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <john@omernik.com <ma...@omernik.com>> wrote:
> >
> > > AWESOME!
> > >
> > > I had just been in the process of writing up a long user story to ask for
> > > and support exactly this.   I modified it and included it here:
> > >
> > >
> > > To start out, I want to say how much I love the Drill project, and the
> > > potential it has. I've put this together based on my experiences and want
> > > to contribute a perspective as a user, not just put a bunch of critiques
> > in
> > > an email.  I hope it's all taken in that spirit.  Additional note, I
> > wrote
> > > this prior to seeing the Design Document share by Hsuan Yi Chu yesterday.
> > > If you are reading it, and think to yourself “that wording is odd…”
> > please
> > > consider it from the “I didn’t want to throw away the user story”
> > > perspective and the “I wrote it before the design doc” perspective.
> > >
> > >
> > >
> > > Additionally, I understand that some of what I am suggesting may not be
> > > easy from a development perspective.  I am just being upfront with my
> > > experience, so we can look to determine what can be done; I am not
> > looking
> > > for a silver bullet here, just looking for improvement.  Some may be as
> > > simple as better documentation, other suggestions may be harder to
> > > implement.  Either way, I thought a verbose user story might be useful to
> > > the community as a whole.
> > >
> > >
> > >
> > > John
> > >
> > >
> > >
> > > *User Story*
> > >
> > >
> > >
> > > As I have been working with Drill for data exploration, I came across
> > > multiple "things" that just were hard.  In dealing with some data,
> > > especially JSON data, it can be ugly, and scaled ugly is even worse!
> > >
> > >
> > >
> > > For this story, I am working with a JSON dump from MongoDB, and you would
> > > think it would be well structured, and for the most part it is.  There
> > are
> > > some application level mistakes that were made (I will go into that in a
> > > moment), but in general Drill handles this well.  So with this data set,
> > > there are a few main challenges I am seeing:
> > >
> > >
> > >
> > > 1.     When there is a field that has a float, and then a later record
> > has
> > > the number 0 in it (which Drill takes as a INT). This is a known problem
> > > and one that Drill has a solution for.
> > >
> > > 2.     When there is a field is of one type (a map) and then a later
> > record
> > > has a string in it.  No easy solution here.
> > >
> > > 3.     Select * where there is a json field with a . in the name. I won’t
> > > go into details here, but I feel this factors into data exploration,
> > > because it changes the ability to “stay in Drill” to explore their data (
> > > https://issues.apache.org/jira/browse/DRILL-3922 <https://issues.apache.org/jira/browse/DRILL-3922>)
> > >
> > > 4.     Error reporting challenges
> > >
> > >
> > >
> > >
> > >
> > > With the problem summary laid out, I wanted to walk through my process in
> > > working with this data, and where, if I were a user Drill could have been
> > > much more helpful to the process.
> > >
> > >
> > >
> > > Here is a description of the process I went through:
> > >
> > >
> > >
> > > 1.     Copy data into filesystem
> > >
> > > 2.     Use drill to “Select * from `path_to/dump.json` limit 1
> > >
> > > 3.     (I just want to see what it looks like!)
> > >
> > >
> > >
> > >
> > >
> > > Here I get this error:
> > >
> > >
> > >
> > > > select * from `path_to/ dump.json` limit 1;
> > >
> > > Error: DATA_READ ERROR: You tried to write a BigInt type when you are
> > using
> > > a ValueWriter of type NullableFloat8WriterImpl.
> > >
> > >
> > >
> > > File  /data/dev/path_to/dump.json
> > >
> > > Record  1
> > >
> > > Line  1
> > >
> > > Column  9054
> > >
> > > Field  entropy
> > >
> > > Fragment 0:0
> > >
> > >
> > >
> > > This isn’t incredibly helpful from a user perspective.  I.e. When I
> > Google
> > > around, I realize now that in the docs it talks about “Schema Changes”
> > and
> > > one possible item is use the setting below. However, examples of the data
> > > that was trying to be displayed (with it’s implied type) may help users
> > > grok what is happening.  At least in this case it showed me the field
> > name!
> > >
> > >
> > >
> > > ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
> > >
> > >
> > >
> > > This is a great example where since we have known use case (when numbers
> > > are doubles but someone tries to store 0 an INT) it fails, thus dev’s
> > have
> > > added a setting to allow a user to get through that, that the error
> > message
> > > could be more helpful.   In this case, Showing two record numbers (line
> > > numbers) with different types, the field values with their implied types,
> > > and perhaps a suggestion about using the setting to address the problem.
> > > This could make it more intuitive for the user to stay in Drill, and stay
> > > in the data.   In this case, I looked at a head of the file, and saw the
> > > issue and was able to proceed.
> > >
> > >
> > >
> > > Also, as a corollary here, the user documentation does not show this
> > error
> > > related to the schema change problem. This would be a great place to
> > state,
> > > “if you see an error that looks like X, this is what is happening and
> > what
> > > you can do for it.”
> > >
> > >
> > >
> > >
> > >
> > > *Side node on documentation*
> > >
> > > We should look to have documentation try to be role based.   In this
> > case,
> > > the documentation says use “ALTER SYSTEM” I would argue, and I am
> > guessing
> > > others would concur, that for this use case, “ALTER SESSION” may be a
> > > better suggestion as this is specific alteration to address the use case
> > of
> > > loading/querying a specific data set, and is likely done by a user of the
> > > system.
> > >
> > >
> > >
> > > If a user is doing self-serve data, then in an enterprise environment,
> > they
> > > may not have the ability to use ALTER SYSTEM and get an error, thus may
> > be
> > > confused on how to proceed.   In addition ALTER SYSTEM by a user who
> > > doesn’t understand that they are changing, yet have the rights to change,
> > > may introduce future data problems they didn’t expect.   I like that the
> > > default is a more constrictive method, because it makes people be
> > explicit
> > > about data, yet the documentation should also aim to be explicit about
> > > something like a system wide change.
> > >
> > >
> > >
> > >
> > >
> > > *Back to the story*
> > >
> > > Ok so now I will do ALTER SESSION SET on the read_numbers_as_double
> > setting
> > >
> > >
> > >
> > > I run the query again.
> > >
> > >
> > >
> > > > select * from `path_to/dump.json` limit 1;
> > >
> > > Error: DATA_READ ERROR: Error parsing JSON - You tried to write a VarChar
> > > type when you are using a ValueWriter of type SingleMapWriter.
> > >
> > >
> > >
> > > File  /data/dev/path_to/dump.json
> > >
> > > Record  4009
> > >
> > > Fragment 0:0
> > >
> > >
> > >
> > > Another error   But what does this one mean? Ok, now that I have been
> > > living in the docs and in the Drill user list, and because it’s similar
> > to
> > > the schema change issue, that that is what we are looking at here.
> > Instead
> > > of double to int, we have one field that is map most of the time, and in
> > > some cases it’s a string.
> > >
> > >
> > >
> > > But this doesn’t really help me as a user.  To troubleshoot this Drill
> > > doesn’t offer any options. This file is 500 MB of dense and nested JSON
> > > data with 51k records.   My solution? I took the record number, then I
> > went
> > > to my NFS mounted clustered file system (thank goodness I had MapR here,
> > I
> > > am not sure how I would have done this with Posix tools)
> > >
> > >
> > >
> > > My command: $ head -4009 dump.json|tail -1
> > >
> > >
> > >
> > > That (I hoped) showed me the record in question, note the error from
> > Drill
> > > didn’t tell me which field was at fault here, so I had to visually align
> > > things to address that.  However, I was able to spot the difference and
> > > work with the dev to understand why that happened. I removed those
> > records,
> > > and things worked correctly.
> > >
> > >
> > >
> > > Could there have been a way to identify that within drill? My solution
> > was
> > > to take a python script and read through, and discard those records that
> > > were not a map, however, on 500MB that can work, but what about 500 GB?
> > I
> > > guess a Spark job could clean the data…. But could Drill be given some
> > > tools to help with this situation?
> > >
> > >
> > >
> > > For example, the first thing I said was: What field is at issue?  I had
> > no
> > > way to see what was up there.  I had to use other tools to see the data
> > so
> > > I could understand the problem. Then when I understood the problem, I had
> > > to use Python to produce data that was queryable.
> > >
> > >
> > >
> > > Based on the design document Hsuan Yi Chu just posted to the mailing
> > list,
> > > at this point my post is just a user story to support the design
> > document.
> > > To summarize the points I’d like to see included in the design document
> > > (from a user perspective), not understanding “how or why”:
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *1.     **Error messages that are more verbose in explaining the problem*
> > >
> > > a.     Filename, row number, column number or name
> > >
> > > b.     Option to output the “offending row”
> > >
> > > c.     Showing the data that is causing the error WITH the type Drill
> > > inferred.
> > >
> > > d.     If there are options to help work through dirty data, perhaps the
> > > error message could include those: “Data was an double, then drill found
> > > this data: 0 that was a int in File x, at row 24 in column
> > “myfloatingdata”
> > > consider using store.json.read_numbers_as_double to address the issue.
> > >
> > > 2.     *A way to determine how common this exception is*
> > >
> > > a.     If I am playing with a messy data set, and this error happens,
> > does
> > > it happen on 1 record? 2? 5000?  Knowing that information would:
> > >
> > >                                                i.     Help users
> > understand
> > > how Drill is seeing that particular column
> > >
> > >                                              ii.     Make decisions on
> > > excluding data rather than just removing it. What if the first 10 records
> > > were errors, and then you excluded the remaining 10 million because they
> > > were correct yet different from the first 10?
> > >
> > > b.     Perhaps there could be a “stats” function that only works if it’s
> > > the only selected item or if the select is all those functions (stats
> > > functions)?
> > >
> > >                                                i.     Select
> > > type_stats(fieldsname) from data
> > >
> > >                                              ii.      (that wouldn’t
> > error
> > > on different types)
> > >
> > > 3.     *An ability to set a “return null on this field if error or if non
> > > castable to X type, especially in a view, perhaps in a function.*
> > >
> > > a.     Allow them to not have to reparse data outside drill
> > >
> > > b.     Load it into a sane format (one time loads/ETL to clean data)
> > >
> > > c.     Not be system or session wide exception.
> > >
> > >                                                i.     I think this is
> > > important because I may have a field where I want it to read the numbers
> > as
> > > double, but what if I have another field in the same dataset where I
> > don’t
> > > want it to read the numbers as double? A SYSTEM or SESSION level variable
> > > takes away that granularity
> > >
> > > d.     Select field1, CASTORNULL(field2, int) as field2,
> > CASTORNULL(field3,
> > > double) as field3 from ugly_data.
> > >
> > > e.     That’s an example when it’s in the select, but I Could see a where
> > > clause
> > >
> > > f.      Select field1, field2, field3 from ugly data where ISTYPE(field2,
> > > int) and ISTYPE(field3, double)
> > >
> > > 4.     *Updating of the documentation related to ALTER SESSION vs ALTER
> > > SYSTEM with an eye to the context of the majority use case of the
> > > documented feature*
> > >
> > > a.     For data loads, the documentation uses ALTER SYSTEM and that’s
> > > problematic because:
> > >
> > >                                                i.     Not all users have
> > > the privileges to issue an ALTER SYSTEM. Thus a new user trying to figure
> > > things out may not realize they can just ALTER SESSION after getting an
> > > ALTER SYSTEM error.
> > >
> > >                                              ii.     ALTER SYSTEM on data
> > > loading items, especially in areas that make Drill’s data interpretation
> > > more permissive can lead to unintended consequences later. An admin, who
> > > may be a good systems admin, and helps a data user troubleshoot and error
> > > may issue an ALTER SYSTEM not realizing this changes all future data
> > > imports.
> > >
> > > b.     Note, I found a few cases, but I would suggest a thorough review
> > of
> > > the various use cases throughout the documentation, and in areas where it
> > > really could be either, have a small paragraph indicating the
> > ramifications
> > > of either command.
> > >
> > > *5.     **A Philosophy within the Drill Community to “Stay in Drill” for
> > > data exploration*
> > >
> > > a.     This is obviously not as much of a development thing as a mindset.
> > > If someone says “I tried to do X, and I got and error” and the
> > communities
> > > response is Y where Y is “Look through your data and do Z to it so Drill
> > > can read it” then we should reconsider that scenario and try to provide
> > and
> > > option within Drill to intuitively handle the edge case.  This is
> > > difficult.
> > >
> > > b.     There are cases even in the documentation where this is the case:
> > > https://drill.apache.org/docs/json-data-model/ <https://drill.apache.org/docs/json-data-model/> talking about arrays at
> > the
> > > root level or reading some empty arrays.  In these cases, we have to
> > leave
> > > drill to fix the problem. This works on small data, but may not work on
> > > large or wide data. Consider the  array at root level limitation.  What
> > if
> > > some process out of the users control produces 1000 100mb json files and
> > we
> > > want to read that. To fix it, we have to address those files. Lots of
> > work
> > > there, either manual or automated.
> > >
> > > c.     Once again I know this isn’t easy, but we shouldn’t answer
> > questions
> > > about how to do something by saying “fix this outside of Drill so Drill
> > can
> > > read your data” if at all possible.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > I hope this story helps support the design document presented.  I am
> > happy
> > > to participate in more discussion around these topics as I have enjoying
> > > digging into the internals of Drill
> > >
> > >
> > >
> > > John Omernik
> > >
> >
> >
> >
> > --
> > *Name: Ganesh Semalty*
> > *Location: Gurgaon,Haryana(India)*
> > *Email Id: g4ganeshsemalty@gmail.com <ma...@gmail.com> <g4ganeshsemalty@gmail.com <ma...@gmail.com>>*
> >
> >
> > P
> >
> > *Please consider the environment before printing this e-mail - SAVE TREE.*
> >
> 
> 
> 
> -- 
> Name: Ganesh Semalty
> Location: Gurgaon,Haryana(India)
> Email Id: g4ganeshsemalty@gmail.com <ma...@gmail.com>
>  
> P
> Please consider the environment before printing this e-mail - SAVE TREE.

Re: New Topic: Drill Visualization and Use Support

Posted by ganesh <g4...@gmail.com>.

Hi John,

Thanks for suggesting new name:  Apache Zeppelin

I was currently trying 14 Days trial version of TABLEAU with not much
success.
Today only I knew that for files in Hadoop or local file system, I would
need to create view.

Still, though I can see my Tables from HIVE in Tableau, I cannot see any
data.
Nor I am able to use TABLEAU from the links help given in apache-drill
currently (http://drill.apache.org/docs/tableau-examples/)

*Snapshot attached,* Incase you have worked over TABLEAU

I will look into Zeppelin also.


On Thu, Oct 22, 2015 at 6:43 PM, John Omernik <jo...@omernik.com> wrote:

> I separated my response from the original topic to keep any responses there
> focused on the design document.
>
> As to ways to use Drill, I have been working with SQL Squirrel quite
> successfully.  On my list of things to do, if you want to stay in the
> Apache world is looking at Apache Zeppelin.  In the Git Repo, there is a
> Drill Plugin so you can run SQL again Drill, look at results, and do basic
> visualizations, I have been trying to way until the PR is merged on
> Zeppelin, but for your use case, you may want to grab the plugin code and
> run with it.
>
>
>
>
> On Thu, Oct 22, 2015 at 6:06 AM, ganesh <g4...@gmail.com> wrote:
>
> > Hello,
> >
> > John, you seem to be quite impressed with apache drill .. nice.
> > I am new to un-structured world and just started 1 week back on APACHE
> > DRILL after suggestion from my collegues. We have a semi structured data
> > where we have constraint that we do not know number of columns
> >
> > I heard that APACHE DRILL is column free applicationa nd with support of
> > JSON format, it allows to create columns on-fly,
> > I converted my data from CSV-like-format to JSON and trying to figure out
> > if it will work for me.
> >
> > Here I hit two issues :-
> > 1) My column were like : 3100.2.1.2  and values like '-2303" or
> > '01/01/2015
> > 02:02:00"
> >
> > Challenge was that column cant be started with Numeric value. So I had to
> > change key as: "t3100.2.1.2"
> > After that things were quite OK,
> >
> > Now I need some help from you guys. To proceed I have to present my work
> to
> > management as an example.
> > But querying on apache drill console, doesnt seem to be an attractive way
> > to present things.
> >
> > I tried drill explorer too.But didnt find that so good.
> > One thing to note, I am playing with files on Hadoop standalone mode in
> > UBUNTU.
> >
> > To make it appear more good looking, I started with QLIK SENSE .. but was
> > unable to connect it with hadoop file system. It only showed me HIVE
> FILES.
> > Then I downloaded TABLEAU Trial version ... but I am unable to get Hadoop
> > data here too...
> >
> > Please help me how to proceed. I have presentation on coming Monday.
> > Queries are quite ready .. I just need to show in visualization form
> > ........ using OPEN SOURCE applications only.
> >
> >
> > Guys please help me.
> >
> >
> >
> > On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <jo...@omernik.com> wrote:
> >
> > > AWESOME!
> > >
> > > I had just been in the process of writing up a long user story to ask
> for
> > > and support exactly this.   I modified it and included it here:
> > >
> > >
> > > To start out, I want to say how much I love the Drill project, and the
> > > potential it has. I've put this together based on my experiences and
> want
> > > to contribute a perspective as a user, not just put a bunch of
> critiques
> > in
> > > an email.  I hope it's all taken in that spirit.  Additional note, I
> > wrote
> > > this prior to seeing the Design Document share by Hsuan Yi Chu
> yesterday.
> > > If you are reading it, and think to yourself “that wording is odd…”
> > please
> > > consider it from the “I didn’t want to throw away the user story”
> > > perspective and the “I wrote it before the design doc” perspective.
> > >
> > >
> > >
> > > Additionally, I understand that some of what I am suggesting may not be
> > > easy from a development perspective.  I am just being upfront with my
> > > experience, so we can look to determine what can be done; I am not
> > looking
> > > for a silver bullet here, just looking for improvement.  Some may be as
> > > simple as better documentation, other suggestions may be harder to
> > > implement.  Either way, I thought a verbose user story might be useful
> to
> > > the community as a whole.
> > >
> > >
> > >
> > > John
> > >
> > >
> > >
> > > *User Story*
> > >
> > >
> > >
> > > As I have been working with Drill for data exploration, I came across
> > > multiple "things" that just were hard.  In dealing with some data,
> > > especially JSON data, it can be ugly, and scaled ugly is even worse!
> > >
> > >
> > >
> > > For this story, I am working with a JSON dump from MongoDB, and you
> would
> > > think it would be well structured, and for the most part it is.  There
> > are
> > > some application level mistakes that were made (I will go into that in
> a
> > > moment), but in general Drill handles this well.  So with this data
> set,
> > > there are a few main challenges I am seeing:
> > >
> > >
> > >
> > > 1.     When there is a field that has a float, and then a later record
> > has
> > > the number 0 in it (which Drill takes as a INT). This is a known
> problem
> > > and one that Drill has a solution for.
> > >
> > > 2.     When there is a field is of one type (a map) and then a later
> > record
> > > has a string in it.  No easy solution here.
> > >
> > > 3.     Select * where there is a json field with a . in the name. I
> won’t
> > > go into details here, but I feel this factors into data exploration,
> > > because it changes the ability to “stay in Drill” to explore their
> data (
> > > https://issues.apache.org/jira/browse/DRILL-3922)
> > >
> > > 4.     Error reporting challenges
> > >
> > >
> > >
> > >
> > >
> > > With the problem summary laid out, I wanted to walk through my process
> in
> > > working with this data, and where, if I were a user Drill could have
> been
> > > much more helpful to the process.
> > >
> > >
> > >
> > > Here is a description of the process I went through:
> > >
> > >
> > >
> > > 1.     Copy data into filesystem
> > >
> > > 2.     Use drill to “Select * from `path_to/dump.json` limit 1
> > >
> > > 3.     (I just want to see what it looks like!)
> > >
> > >
> > >
> > >
> > >
> > > Here I get this error:
> > >
> > >
> > >
> > > > select * from `path_to/ dump.json` limit 1;
> > >
> > > Error: DATA_READ ERROR: You tried to write a BigInt type when you are
> > using
> > > a ValueWriter of type NullableFloat8WriterImpl.
> > >
> > >
> > >
> > > File  /data/dev/path_to/dump.json
> > >
> > > Record  1
> > >
> > > Line  1
> > >
> > > Column  9054
> > >
> > > Field  entropy
> > >
> > > Fragment 0:0
> > >
> > >
> > >
> > > This isn’t incredibly helpful from a user perspective.  I.e. When I
> > Google
> > > around, I realize now that in the docs it talks about “Schema Changes”
> > and
> > > one possible item is use the setting below. However, examples of the
> data
> > > that was trying to be displayed (with it’s implied type) may help users
> > > grok what is happening.  At least in this case it showed me the field
> > name!
> > >
> > >
> > >
> > > ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
> > >
> > >
> > >
> > > This is a great example where since we have known use case (when
> numbers
> > > are doubles but someone tries to store 0 an INT) it fails, thus dev’s
> > have
> > > added a setting to allow a user to get through that, that the error
> > message
> > > could be more helpful.   In this case, Showing two record numbers (line
> > > numbers) with different types, the field values with their implied
> types,
> > > and perhaps a suggestion about using the setting to address the
> problem.
> > > This could make it more intuitive for the user to stay in Drill, and
> stay
> > > in the data.   In this case, I looked at a head of the file, and saw
> the
> > > issue and was able to proceed.
> > >
> > >
> > >
> > > Also, as a corollary here, the user documentation does not show this
> > error
> > > related to the schema change problem. This would be a great place to
> > state,
> > > “if you see an error that looks like X, this is what is happening and
> > what
> > > you can do for it.”
> > >
> > >
> > >
> > >
> > >
> > > *Side node on documentation*
> > >
> > > We should look to have documentation try to be role based.   In this
> > case,
> > > the documentation says use “ALTER SYSTEM” I would argue, and I am
> > guessing
> > > others would concur, that for this use case, “ALTER SESSION” may be a
> > > better suggestion as this is specific alteration to address the use
> case
> > of
> > > loading/querying a specific data set, and is likely done by a user of
> the
> > > system.
> > >
> > >
> > >
> > > If a user is doing self-serve data, then in an enterprise environment,
> > they
> > > may not have the ability to use ALTER SYSTEM and get an error, thus may
> > be
> > > confused on how to proceed.   In addition ALTER SYSTEM by a user who
> > > doesn’t understand that they are changing, yet have the rights to
> change,
> > > may introduce future data problems they didn’t expect.   I like that
> the
> > > default is a more constrictive method, because it makes people be
> > explicit
> > > about data, yet the documentation should also aim to be explicit about
> > > something like a system wide change.
> > >
> > >
> > >
> > >
> > >
> > > *Back to the story*
> > >
> > > Ok so now I will do ALTER SESSION SET on the read_numbers_as_double
> > setting
> > >
> > >
> > >
> > > I run the query again.
> > >
> > >
> > >
> > > > select * from `path_to/dump.json` limit 1;
> > >
> > > Error: DATA_READ ERROR: Error parsing JSON - You tried to write a
> VarChar
> > > type when you are using a ValueWriter of type SingleMapWriter.
> > >
> > >
> > >
> > > File  /data/dev/path_to/dump.json
> > >
> > > Record  4009
> > >
> > > Fragment 0:0
> > >
> > >
> > >
> > > Another error   But what does this one mean? Ok, now that I have been
> > > living in the docs and in the Drill user list, and because it’s similar
> > to
> > > the schema change issue, that that is what we are looking at here.
> > Instead
> > > of double to int, we have one field that is map most of the time, and
> in
> > > some cases it’s a string.
> > >
> > >
> > >
> > > But this doesn’t really help me as a user.  To troubleshoot this Drill
> > > doesn’t offer any options. This file is 500 MB of dense and nested JSON
> > > data with 51k records.   My solution? I took the record number, then I
> > went
> > > to my NFS mounted clustered file system (thank goodness I had MapR
> here,
> > I
> > > am not sure how I would have done this with Posix tools)
> > >
> > >
> > >
> > > My command: $ head -4009 dump.json|tail -1
> > >
> > >
> > >
> > > That (I hoped) showed me the record in question, note the error from
> > Drill
> > > didn’t tell me which field was at fault here, so I had to visually
> align
> > > things to address that.  However, I was able to spot the difference and
> > > work with the dev to understand why that happened. I removed those
> > records,
> > > and things worked correctly.
> > >
> > >
> > >
> > > Could there have been a way to identify that within drill? My solution
> > was
> > > to take a python script and read through, and discard those records
> that
> > > were not a map, however, on 500MB that can work, but what about 500 GB?
> > I
> > > guess a Spark job could clean the data…. But could Drill be given some
> > > tools to help with this situation?
> > >
> > >
> > >
> > > For example, the first thing I said was: What field is at issue?  I had
> > no
> > > way to see what was up there.  I had to use other tools to see the data
> > so
> > > I could understand the problem. Then when I understood the problem, I
> had
> > > to use Python to produce data that was queryable.
> > >
> > >
> > >
> > > Based on the design document Hsuan Yi Chu just posted to the mailing
> > list,
> > > at this point my post is just a user story to support the design
> > document.
> > > To summarize the points I’d like to see included in the design document
> > > (from a user perspective), not understanding “how or why”:
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *1.     **Error messages that are more verbose in explaining the
> problem*
> > >
> > > a.     Filename, row number, column number or name
> > >
> > > b.     Option to output the “offending row”
> > >
> > > c.     Showing the data that is causing the error WITH the type Drill
> > > inferred.
> > >
> > > d.     If there are options to help work through dirty data, perhaps
> the
> > > error message could include those: “Data was an double, then drill
> found
> > > this data: 0 that was a int in File x, at row 24 in column
> > “myfloatingdata”
> > > consider using store.json.read_numbers_as_double to address the issue.
> > >
> > > 2.     *A way to determine how common this exception is*
> > >
> > > a.     If I am playing with a messy data set, and this error happens,
> > does
> > > it happen on 1 record? 2? 5000?  Knowing that information would:
> > >
> > >                                                i.     Help users
> > understand
> > > how Drill is seeing that particular column
> > >
> > >                                              ii.     Make decisions on
> > > excluding data rather than just removing it. What if the first 10
> records
> > > were errors, and then you excluded the remaining 10 million because
> they
> > > were correct yet different from the first 10?
> > >
> > > b.     Perhaps there could be a “stats” function that only works if
> it’s
> > > the only selected item or if the select is all those functions (stats
> > > functions)?
> > >
> > >                                                i.     Select
> > > type_stats(fieldsname) from data
> > >
> > >                                              ii.      (that wouldn’t
> > error
> > > on different types)
> > >
> > > 3.     *An ability to set a “return null on this field if error or if
> non
> > > castable to X type, especially in a view, perhaps in a function.*
> > >
> > > a.     Allow them to not have to reparse data outside drill
> > >
> > > b.     Load it into a sane format (one time loads/ETL to clean data)
> > >
> > > c.     Not be system or session wide exception.
> > >
> > >                                                i.     I think this is
> > > important because I may have a field where I want it to read the
> numbers
> > as
> > > double, but what if I have another field in the same dataset where I
> > don’t
> > > want it to read the numbers as double? A SYSTEM or SESSION level
> variable
> > > takes away that granularity
> > >
> > > d.     Select field1, CASTORNULL(field2, int) as field2,
> > CASTORNULL(field3,
> > > double) as field3 from ugly_data.
> > >
> > > e.     That’s an example when it’s in the select, but I Could see a
> where
> > > clause
> > >
> > > f.      Select field1, field2, field3 from ugly data where
> ISTYPE(field2,
> > > int) and ISTYPE(field3, double)
> > >
> > > 4.     *Updating of the documentation related to ALTER SESSION vs ALTER
> > > SYSTEM with an eye to the context of the majority use case of the
> > > documented feature*
> > >
> > > a.     For data loads, the documentation uses ALTER SYSTEM and that’s
> > > problematic because:
> > >
> > >                                                i.     Not all users
> have
> > > the privileges to issue an ALTER SYSTEM. Thus a new user trying to
> figure
> > > things out may not realize they can just ALTER SESSION after getting an
> > > ALTER SYSTEM error.
> > >
> > >                                              ii.     ALTER SYSTEM on
> data
> > > loading items, especially in areas that make Drill’s data
> interpretation
> > > more permissive can lead to unintended consequences later. An admin,
> who
> > > may be a good systems admin, and helps a data user troubleshoot and
> error
> > > may issue an ALTER SYSTEM not realizing this changes all future data
> > > imports.
> > >
> > > b.     Note, I found a few cases, but I would suggest a thorough review
> > of
> > > the various use cases throughout the documentation, and in areas where
> it
> > > really could be either, have a small paragraph indicating the
> > ramifications
> > > of either command.
> > >
> > > *5.     **A Philosophy within the Drill Community to “Stay in Drill”
> for
> > > data exploration*
> > >
> > > a.     This is obviously not as much of a development thing as a
> mindset.
> > > If someone says “I tried to do X, and I got and error” and the
> > communities
> > > response is Y where Y is “Look through your data and do Z to it so
> Drill
> > > can read it” then we should reconsider that scenario and try to provide
> > and
> > > option within Drill to intuitively handle the edge case.  This is
> > > difficult.
> > >
> > > b.     There are cases even in the documentation where this is the
> case:
> > > https://drill.apache.org/docs/json-data-model/ talking about arrays at
> > the
> > > root level or reading some empty arrays.  In these cases, we have to
> > leave
> > > drill to fix the problem. This works on small data, but may not work on
> > > large or wide data. Consider the  array at root level limitation.  What
> > if
> > > some process out of the users control produces 1000 100mb json files
> and
> > we
> > > want to read that. To fix it, we have to address those files. Lots of
> > work
> > > there, either manual or automated.
> > >
> > > c.     Once again I know this isn’t easy, but we shouldn’t answer
> > questions
> > > about how to do something by saying “fix this outside of Drill so Drill
> > can
> > > read your data” if at all possible.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > I hope this story helps support the design document presented.  I am
> > happy
> > > to participate in more discussion around these topics as I have
> enjoying
> > > digging into the internals of Drill
> > >
> > >
> > >
> > > John Omernik
> > >
> >
> >
> >
> > --
> > *Name: Ganesh Semalty*
> > *Location: Gurgaon,Haryana(India)*
> > *Email Id: g4ganeshsemalty@gmail.com <g4...@gmail.com>*
> >
> >
> > P
> >
> > *Please consider the environment before printing this e-mail - SAVE
> TREE.*
> >
>



-- 
*Name: Ganesh Semalty*
*Location: Gurgaon,Haryana(India)*
*Email Id: g4ganeshsemalty@gmail.com <g4...@gmail.com>*


P

*Please consider the environment before printing this e-mail - SAVE TREE.*