You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by "Siprell, Stefan" <st...@exxeta.de> on 2013/01/10 14:45:33 UTC

Introduction

Hi all,
I am working for a IT consulting agency in Germany. One of the goals of our team for 2013 is active (as in giving) participation in the open source community and offering our customers cutting-edge analytical tools for large to huge data bases. You guys hit the spot!

I would like to start offering my personal help (volunteer work for now, later I could pitch in a day or two per week perhaps) in any role which would help. I am a somewhat strong enterprise java developer, can deal sufficiently well with HTML5 frontends, know most things about build environments and testing and should be able to do some design or documentation.

Is there anything I can do?

Stefan

Re: Introduction

Posted by Jacques Nadeau <ja...@gmail.com>.

On Sun, Jan 20, 2013 at 11:39 AM, Siprell, Stefan
<st...@exxeta.de>wrote:

> Hi Jacques,
> maybe I am over confident, but I think we have a great discussion going on
> here. Should we continue it using the developer mailing list, do we already
> have a policy on this?
>

As Ted said, this is a great place for this discussion.  I'm really
appreciate you pushing the boundaries and am personally finding it quite
useful.



> Either way I am enthusiastic to give my response. Purposely I drifted a
> bit away from the regular SQL syntax to provoke some creative discussion. I
> thought it would be easiest to think about the query language at the very
> beginning of the project before we settle on SQL. I will give my feedback
> on the suggested queries at the end of the mail.
>

Good idea.  Remember, we're not going to settle on one particular query
language.  Our goal is to just make the Logical Plan expressive enough to
support these concepts.  I really want super-SQL to be among the first
languages but the more the better.



> Using a select * on hierarchical data should absolutely return a deep copy
> of hierarchical data - granted. But I think that using the sql simply to
> prune branches from the hierarchical data is not desired, so you basically
> have to offer some kind of document translation sooner or later. Document
> translation languages tend to have a very verbose syntax and require lots
> of cumbersome coding - just look at XSLT. Hence my suggestion to make flat
> output a first class citizen, as the result and language constructs are
> familiar to its users. One can always add XQuery or something similar to
> deepen the data again. I don't want to over emphasize the ease of use
> issue, but the prime reason I am interested in Drill, that it allows real
> time processing of queries on large datasets. I am thinking of developers
> sitting in front of consoles and typing queries and firing them away, so I
> was really trying to define a compact and readable language. Remembering my
> XSLT times: i really was not all that efficient fiddling with this.
>
> Agreed.  To my perspective, we should try to add support for a reasonable
subset of use cases without destroying usability.


> I am not skeptical at all, that SQL will achieve the desired goals. I am
> just wondering if there is anything better :-)
>

There is room for both :)


> Feedback on the first query:
> Looking at the sub select recordsPerRevision, I wonder how the query
> builder would know that we want one entry per page. Is this calculated by
> the fact that in the path mediawiki.page.title and mediawiki.page.id both
> have mediawiki.page as the last non scalar path entry? What would happen,
> if we need two columns with different path depths? I would at least suggest
> to model this more explicitly using:
>
>  select
>    title as pageTitle,
>    id as pageId,
>    flatten(revision) as rev
>  from mediawiki.page
>
>
My thoughts were that a straight dot path (without repeated item indicators
'[]') would always be one per record.  As such, the concept of flatten is
clear here.  So explicit or implicit seems okay.  (For example,
mediawiki.page.revision.timestamp would actually return a null value.)

The multi path entry from would clearly point the reference for further
> references and give the user and query builder would immediately recognize
> how many results we expect for what. Can the query compiler cope with this
> heavy duty flatten operation in a select clause? We are basically running a
> join on flattened children of a node, and we describe this non-chalantly in
> the select clause and reference this in the order by statement. Possibly
> even used in aggregation function. I think this would be great, but will it
> be simple to implement? Google seems to see this in the from and not in the
> select claus as well.
>

Time will tell on implementability.  It kind of reminds me of a distinct
clause in traditional sql. The syntax simplicity belies the executional
complexity.


> I really like the within statement, but I am note sure if this works as
> expects. The where statement should only show revisions which occured at
> the time given, then we want to show information on the pages which
> contained these revisions, something like a right join. Within seems to
> work as a left join, showing us pages (parents) which do not have any
> children (flatten) as well. A minor correction for the within clause: it is
> not based on RECORD but on mediawiki.page.revision. As I mentioned, it
> might be confusing if a right, left, inner, outer join is being done in
> which basis. At least for a dummy like me :-)
>

Good point. Reminds me of the difference between where and having.  Not
sure what the right solution is with regards to contextually correct
filtering clause.  I'm kind of driven back towards a subquery solution.
 The optimizer should be able to smash them back together on execution even
if they were built as a logical subquery.


> I am again confused on the partitioned aggregation.Is there where
> statement executed logically before or after the aggregation? Same thing on
> the last suggested query. I can very very clearly see what you trying to
> achieve, but I am only human. Without having ever written a query planner,
> I am uncertain if the machine can resolve ambiguities on how the from,
> aggregation, select and join operations work together.
>

Yeah, same problem here.  This becomes even more problematic when we are
unsure of the schema until we actually start running the query (e.g.
running a first time query against json).


>
> If we want to stick to a SQL dialect, then we might as well copy the
> BigQuery from Google syntax. If the google api has some shortcoming, we
> should perhaps address this and explicitly name the issues. Perhaps I
> should map the XML to a RBMS and write the same queries in the appropriate
> SQL. But I would really prefer to write the query in the Drill Logical
> Query Plan to be more precise. Do we have some examples or complete
> definition I can get my hands on?
>

I feel like big query maybe strayed away from the spec a little with the
windowing/partitioning stuff.  Otherwise, BigQuery is not far off.  If you
want to write directly to Logical Plan syntax, that would be very helpful.
  The syntax is the best place to start [1].  I made an example query
previously at [2].  Logical reference interpreter is alpha at [3].



[1]
https://docs.google.com/a/maprtech.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit
[2] https://github.com/jacques-n/incubator-drill/wiki/SQL-Example
[3] https://github.com/jacques-n/incubator-drill.

>
> Stefan
>
>
>
>
>
>
>
> On 20.01.2013, at 19:30, Jacques Nadeau <ja...@gmail.com> wrote:
>
> > I spent a little time looking at your first query.  I think, for all the
> > queries, writing a little more description of the query goals would be
> > helpful to ensure that I'm not misinterpreting your objective.
> >
> > select rev.::parent.title, rev.::parent.id, sum(rev.text.bytes)
> > from mediawiki.page.revision as rev
> > where rev.timestamp.between(?, ?)
> > group by rev.::parent;
> >
> > If I were trying to make it more SQL'y, I'd probably go with something
> like:
> >
> > select
> >  pageTitle,
> >  pageId,
> >  sum(rev.bytes) as totalChanges
> > from (
> >  select
> >    mediawiki.page.title as pageTitle,
> >    mediawiki.page.id as pageId,
> >    flatten(media.page.revision) as rev
> >  from mediawiki
> >  where rev.timestamp between ? and ?
> > ) as recordPerRevision
> > group by pageTitle, pageId
> > order by totalChanges desc
> >
> > A cleaner alternative would be providing the more complicated WITHIN
> syntax
> > as BigQuery does:
> >
> > select
> >    mediawiki.page.title as pageTitle,
> >    mediawiki.page.id as pageId,
> >    sum(media.page.revision.bytes) as totalChanges within RECORD
> >  from mediawiki
> >  where rev.timestamp between ? and ?
> >
> >
> > Or extending the SQL2003 windowing functions such as that partitioning
> > within a single record is possible and then makes the aggregating
> functions
> > use relative references.
> >
> > select
> >    mediawiki.page.title as pageTitle,
> >    mediawiki.page.id as pageId,
> >    sum(bytes) as totalChanges OVER(PARTITION BY mediawiki.page.revision)
> >  where rev.timestamp between ? and ?
> >
> > Or providing the simple approach, providing a specialized 'scalar'
> > function: ARRAY_SUM(array_node_to_iterate_over,
> > expression_to_evaluate_on_each_iterated_value) function:
> >
> > select
> >    mediawiki.page.title as pageTitle,
> >    mediawiki.page.id as pageId,
> >    ARRAY_SUM(media.page.revision, bytes) as totalChanges
> >  where rev.timestamp between ? and ?
> >
> >
> >>
> >> I also understood drill was more of an analytical platform. So my
> >> understanding is that we want to access hierarchical data, but we do not
> >> want to generate any. Besides trying to run reports, charts or tables
> >> (typical client applications) on hierarchical data is a mess, as the
> >> toolset simply doesn't support it. Out of this reason, I would focus on
> >> generating flat result for the time being.
> >>
> >>
> > I think this is a really great point. It made me question some of the
> > assumptions I had been operating on.  That being said, I'd like to hold
> off
> > on trimming that tree entirely for the time being.  I'm concerned doing
> so
> > would substantial the effectiveness of ever using nested datasets with
> it.
> > For example, if I do select * from a nested dataset, I really want to see
> > hierarchical data returned.  In the case of building up a single query
> on a
> > number of sub queries, I can see many useful situations where the
> > intermediate queries still maintain hierarchical datasets, even if the
> > final goal output might be a flat data structure for analytical tool use.
> >
> >
> >
> >> If desired I can start writing an ANTLR grammar on the stuff I am
> working
> >> on, to make the output more robust. I had a look at the SQL parser you
> guys
> >> mentioned, but I don't think this would work on my kind of queries, as
> it
> >> drastically expands SQL 2003. All we want to do is to map the AST to
> your
> >> logic plan? I think this can be done quite easily just using ANTLR and
> some
> >> Java classes.
> >>
> >
> > If you want to build a simple query language that generates logical
> plans,
> > that would be interesting.  Given my rewrites, are you still skeptical of
> > minimally extending SQL 2003?
> >
> > Jacques
> >
> >
> >>
> >> Stefan
> >>
> >> On 20.01.2013, at 00:56, Jacques Nadeau <ja...@gmail.com>
> wrote:
> >>
> >>> Many of these haven't been finalized since we're still working on code.
> >>> That being said, let me share what my thoughts have been to date.
> >>>
> >>>> SQL Row maps to a drill record?
> >>> Correct
> >>>
> >>>> And drill would not have a flat sibling structure of nodes, a.k.a.
> >> columns
> >>> but hierarchical nodes?
> >>> Correct.  My general thinking is that a record is a DataValue.
> >>> A DataValue can be one of three major types: a map (string:DataValue),
> an
> >>> ordered list (DataValues[]), or a scalar DataValue.  Most commonly, the
> >>> first DataValue in a record would be a map.  In the case of SQL/flat
> data
> >>> (e.g. CSV), this map would only contain scalar values.
> >>>
> >>>> Will drill access the contents of a record in a stream or document
> >> manner?
> >>> How large may i record be?
> >>> For the first version of Drill, I was thinking that a record must fit
> >>> entirely in memory.  Functions can interact with an entire record as
> they
> >>> choose.
> >>>
> >>>> Can i use Xpath like functions to acces nodes?
> >>> Generally, we hope so.  'Like' being the operative word here.  The path
> >>> expressions that we're thinking of using are substantially simpler than
> >> the
> >>> expressiveness of xpath.  Ultimately, I could see people creating a
> >> parser
> >>> which takes in xquerys and converts them to Drill logical plans.  That
> >>> being said, our goal is more for analytical queries than document
> >>> transformations.
> >>>
> >>>> All of the google bigquery Cook Book Examples seem to generate flat
> >>> Output, is this a limitation?
> >>> In Drill, we don't plan to limit to flat output.  For v1, we're looking
> >> at
> >>> supporting hierarchical expressions in sql 'as' aliases.  We're also
> >>> looking at supporting selections at any level of hierarchy, not just
> the
> >>> leaf level.  We then combine these with a concept of collision behavior
> >>> control so that you can control how to merge multiple nested out values
> >>> into a single output tree.  These will allow one to build a nested
> output
> >>> object.  These are preliminary thoughts.  We need to write more and
> >> discuss
> >>> more.
> >>>
> >>> One thing to remember is that one of Drill's goals is to be flexible.
> >>> Ultimately, different query languages may support different subsets of
> >>> operations and no one query language may include all operators.
> >>>
> >>> Hope that makes sense.
> >>>
> >>> Jacques
> >>>
> >>> On Sat, Jan 19, 2013 at 3:11 PM, Siprell, Stefan
> >>> <st...@exxeta.de>wrote:
> >>>
> >>>> Aaaah studying the Big query docs helped. I may assume, that a SQL Row
> >>>> maps to a drill record? And drill would not have a flat sibling
> >> structure
> >>>> of nodes, a.k.a. columns but hierarchical nodes?   All of the google
> >>>> bigquery Cook Book Examples seem to generate flat Output, is this a
> >>>> limitation? If not how would i generate my hierarchical Output Model,
> >>>> without using a groovy builder or xquery :-)
> >>>>
> >>>>
> >>>> Stefan
> >>>>
> >>>> Von meinem iPad gesendet
> >>>>
> >>>> Am 20.01.2013 um 00:01 schrieb "Jacques Nadeau" <
> >> jacques.drill@gmail.com>:
> >>>>
> >>>>> Fair enough.  Starting with big query syntax or SQL 2003 and flat
> data
> >>>>> structures will work fine.  I'll try to write something meaningful up
> >>>> about
> >>>>> sql and nested data structures.
> >>>>>
> >>>>> Jacques
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, Jan 19, 2013 at 2:54 PM, Siprell, Stefan
> >>>>> <st...@exxeta.de>wrote:
> >>>>>
> >>>>>> Should I not just use this here as a reference?
> >>>>>>
> >>>>>> https://developers.google.com/bigquery/docs/query-reference
> >>>>>>
> >>>>>> I am a bit stumped to be honest. I am trying to think how to use SQL
> >>>>>> efficiently on Nested Data sturctures.
> >>>>>>
> >>>>>> Von meinem iPad gesendet
> >>>>>>
> >>>>>> Am 19.01.2013 um 19:51 schrieb "Jacques Nadeau" <
> >>>> jacques.drill@gmail.com
> >>>>>> <ma...@gmail.com>>:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> * I drew a UML diagram. I saw that there is some glifffy support in
> >>>>>> confluenc,e but the free account is pretty much useless. I used omni
> >>>>>> graffle to draw the diagram, but this is payware on the mac - is
> there
> >>>> some
> >>>>>> usable freeware alternative? Don't mention tigris :-)
> >>>>>>
> >>>>>>
> >>>>>> I don't have any suggestions on this.
> >>>>>>
> >>>>>>
> >>>>>> * I have some ideas on the queries, but I am not sure how I should
> >>>> specify
> >>>>>> them? Should I use pseudo SQL? Prose? I saw the syntax document on
> the
> >>>>>> server, it it mature enough, that I attempt to use its syntax? Is
> >> there
> >>>> a
> >>>>>> BNF or better ANTLR grammar I can use to check my syntax? Should I
> >> draw
> >>>> one
> >>>>>> up while I am at it?
> >>>>>>
> >>>>>>
> >>>>>> I suggest you target SQL2003 (including subqueries).  We're looking
> at
> >>>> how
> >>>>>> to use Optiq's SQL parser for Drill.  Our goal is to stay as close
> as
> >>>>>> possible to that spec but add the following extensions:
> >>>>>> - Add flatten operator similar to BigQuery syntax
> >>>>>> - Support use of selection and output identifiers using
> >> dotted/bracketed
> >>>>>> notation.  E.g. "select person.children[0].age as
> >>>>>> output.profile.firstChildAge"
> >>>>>> - Support new functions that can accept nested values including
> >>>> collections
> >>>>>> and maps.  For example "select ARRAY_LENGTH(person.children)".
> >>>>>>
> >>>>>> Once you have some sql examples, the next goal would be to manually
> >>>>>> translate those into Logical Plan syntax.  This syntax is still
> >>>> maturing so
> >>>>>> I'd take it to the SQL stage first.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Stefan
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 19.01.2013, at 02:05, Jacques Nadeau <jacques.drill@gmail.com
> >>>> <mailto:
> >>>>>> jacques.drill@gmail.com>> wrote:
> >>>>>>
> >>>>>> The wiki is up.  Michael and Stefan, it would be great if you
> started
> >>>>>> putting your use case thoughts there.
> >>>>>>
> >>>>>> Jacques
> >>>>>>
> >>>>>> On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <ted.dunning@gmail.com
> >>>>>> <ma...@gmail.com>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Ahh... yes.  That wiki.  I will ping infra again.
> >>>>>>
> >>>>>> (I was attaching your comment to the wikipedia use case and had
> >> confused
> >>>>>> myself)
> >>>>>>
> >>>>>> On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
> >>>>>> michael.hausenblas@gmail.com<ma...@gmail.com>>
> >>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> What do you need from me?
> >>>>>>
> >>>>>> Maybe I've overlooked something in which case I apologize - was
> >>>>>> wondering
> >>>>>> if the public Wiki for Drill is available where Stefan, I and others
> >>>>>> can
> >>>>>> write up the UC and queries.
> >>>>>>
> >>>>>> Cheers,
> >>>>>>            Michael
> >>>>>>
> >>>>>> --
> >>>>>> Michael Hausenblas
> >>>>>> Ireland, Europe
> >>>>>> http://mhausenblas.info/
> >>>>>>
> >>>>>> On 13 Jan 2013, at 14:20, Ted Dunning <ted.dunning@gmail.com
> <mailto:
> >>>>>> ted.dunning@gmail.com>> wrote:
> >>>>>>
> >>>>>> What do you need from me?
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
> >>>>>> michael.hausenblas@gmail.com<ma...@gmail.com>>
> >>>> wrote:
> >>>>>>
> >>>>>> as soon as we hear back from Ted re the Wiki we work there.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>
> >>
>
>

Re: Introduction

Posted by Ted Dunning <te...@gmail.com>.

The dev list is a perfect place for the back and forth part of this
conversation.  Could I encourage both of your to record some of your
queries and thoughts as they become more more well-formed on the wiki
itself?

Even the simple query that I have retained down below would be interesting
as a first look at Drill.

On Sun, Jan 20, 2013 at 2:39 PM, Siprell, Stefan
<st...@exxeta.de>wrote:

> Hi Jacques,
> maybe I am over confident, but I think we have a great discussion going on
> here. Should we continue it using the developer mailing list, do we already
> have a policy on this?
> ...
> Looking at the sub select recordsPerRevision, I wonder how the query
> builder would know that we want one entry per page. Is this calculated by
> the fact that in the path mediawiki.page.title and mediawiki.page.id both
> have mediawiki.page as the last non scalar path entry? What would happen,
> if we need two columns with different path depths? I would at least suggest
> to model this more explicitly using:
>
>  select
>    title as pageTitle,
>    id as pageId,
>    flatten(revision) as rev
>  from mediawiki.page
>

Re: Introduction

Posted by "Siprell, Stefan" <st...@exxeta.de>.

Hi Jacques,
maybe I am over confident, but I think we have a great discussion going on here. Should we continue it using the developer mailing list, do we already have a policy on this?
Either way I am enthusiastic to give my response. Purposely I drifted a bit away from the regular SQL syntax to provoke some creative discussion. I thought it would be easiest to think about the query language at the very beginning of the project before we settle on SQL. I will give my feedback on the suggested queries at the end of the mail.

Using a select * on hierarchical data should absolutely return a deep copy of hierarchical data - granted. But I think that using the sql simply to prune branches from the hierarchical data is not desired, so you basically have to offer some kind of document translation sooner or later. Document translation languages tend to have a very verbose syntax and require lots of cumbersome coding - just look at XSLT. Hence my suggestion to make flat output a first class citizen, as the result and language constructs are familiar to its users. One can always add XQuery or something similar to deepen the data again. I don't want to over emphasize the ease of use issue, but the prime reason I am interested in Drill, that it allows real time processing of queries on large datasets. I am thinking of developers sitting in front of consoles and typing queries and firing them away, so I was really trying to define a compact and readable language. Remembering my XSLT times: i really was not all that efficient fiddling with this.

I am not skeptical at all, that SQL will achieve the desired goals. I am just wondering if there is anything better :-)

Feedback on the first query:
Looking at the sub select recordsPerRevision, I wonder how the query builder would know that we want one entry per page. Is this calculated by the fact that in the path mediawiki.page.title and mediawiki.page.id both have mediawiki.page as the last non scalar path entry? What would happen, if we need two columns with different path depths? I would at least suggest to model this more explicitly using:

 select
   title as pageTitle,
   id as pageId,
   flatten(revision) as rev
 from mediawiki.page

The multi path entry from would clearly point the reference for further references and give the user and query builder would immediately recognize how many results we expect for what. Can the query compiler cope with this heavy duty flatten operation in a select clause? We are basically running a join on flattened children of a node, and we describe this non-chalantly in the select clause and reference this in the order by statement. Possibly even used in aggregation function. I think this would be great, but will it be simple to implement? Google seems to see this in the from and not in the select claus as well.

I really like the within statement, but I am note sure if this works as expects. The where statement should only show revisions which occured at the time given, then we want to show information on the pages which contained these revisions, something like a right join. Within seems to work as a left join, showing us pages (parents) which do not have any children (flatten) as well. A minor correction for the within clause: it is not based on RECORD but on mediawiki.page.revision. As I mentioned, it might be confusing if a right, left, inner, outer join is being done in which basis. At least for a dummy like me :-)

I am again confused on the partitioned aggregation.Is there where statement executed logically before or after the aggregation? Same thing on the last suggested query. I can very very clearly see what you trying to achieve, but I am only human. Without having ever written a query planner, I am uncertain if the machine can resolve ambiguities on how the from, aggregation, select and join operations work together.

If we want to stick to a SQL dialect, then we might as well copy the BigQuery from Google syntax. If the google api has some shortcoming, we should perhaps address this and explicitly name the issues. Perhaps I should map the XML to a RBMS and write the same queries in the appropriate SQL. But I would really prefer to write the query in the Drill Logical Query Plan to be more precise. Do we have some examples or complete definition I can get my hands on?

Stefan







On 20.01.2013, at 19:30, Jacques Nadeau <ja...@gmail.com> wrote:

> I spent a little time looking at your first query.  I think, for all the
> queries, writing a little more description of the query goals would be
> helpful to ensure that I'm not misinterpreting your objective.
>
> select rev.::parent.title, rev.::parent.id, sum(rev.text.bytes)
> from mediawiki.page.revision as rev
> where rev.timestamp.between(?, ?)
> group by rev.::parent;
>
> If I were trying to make it more SQL'y, I'd probably go with something like:
>
> select
>  pageTitle,
>  pageId,
>  sum(rev.bytes) as totalChanges
> from (
>  select
>    mediawiki.page.title as pageTitle,
>    mediawiki.page.id as pageId,
>    flatten(media.page.revision) as rev
>  from mediawiki
>  where rev.timestamp between ? and ?
> ) as recordPerRevision
> group by pageTitle, pageId
> order by totalChanges desc
>
> A cleaner alternative would be providing the more complicated WITHIN syntax
> as BigQuery does:
>
> select
>    mediawiki.page.title as pageTitle,
>    mediawiki.page.id as pageId,
>    sum(media.page.revision.bytes) as totalChanges within RECORD
>  from mediawiki
>  where rev.timestamp between ? and ?
>
>
> Or extending the SQL2003 windowing functions such as that partitioning
> within a single record is possible and then makes the aggregating functions
> use relative references.
>
> select
>    mediawiki.page.title as pageTitle,
>    mediawiki.page.id as pageId,
>    sum(bytes) as totalChanges OVER(PARTITION BY mediawiki.page.revision)
>  where rev.timestamp between ? and ?
>
> Or providing the simple approach, providing a specialized 'scalar'
> function: ARRAY_SUM(array_node_to_iterate_over,
> expression_to_evaluate_on_each_iterated_value) function:
>
> select
>    mediawiki.page.title as pageTitle,
>    mediawiki.page.id as pageId,
>    ARRAY_SUM(media.page.revision, bytes) as totalChanges
>  where rev.timestamp between ? and ?
>
>
>>
>> I also understood drill was more of an analytical platform. So my
>> understanding is that we want to access hierarchical data, but we do not
>> want to generate any. Besides trying to run reports, charts or tables
>> (typical client applications) on hierarchical data is a mess, as the
>> toolset simply doesn't support it. Out of this reason, I would focus on
>> generating flat result for the time being.
>>
>>
> I think this is a really great point. It made me question some of the
> assumptions I had been operating on.  That being said, I'd like to hold off
> on trimming that tree entirely for the time being.  I'm concerned doing so
> would substantial the effectiveness of ever using nested datasets with it.
> For example, if I do select * from a nested dataset, I really want to see
> hierarchical data returned.  In the case of building up a single query on a
> number of sub queries, I can see many useful situations where the
> intermediate queries still maintain hierarchical datasets, even if the
> final goal output might be a flat data structure for analytical tool use.
>
>
>
>> If desired I can start writing an ANTLR grammar on the stuff I am working
>> on, to make the output more robust. I had a look at the SQL parser you guys
>> mentioned, but I don't think this would work on my kind of queries, as it
>> drastically expands SQL 2003. All we want to do is to map the AST to your
>> logic plan? I think this can be done quite easily just using ANTLR and some
>> Java classes.
>>
>
> If you want to build a simple query language that generates logical plans,
> that would be interesting.  Given my rewrites, are you still skeptical of
> minimally extending SQL 2003?
>
> Jacques
>
>
>>
>> Stefan
>>
>> On 20.01.2013, at 00:56, Jacques Nadeau <ja...@gmail.com> wrote:
>>
>>> Many of these haven't been finalized since we're still working on code.
>>> That being said, let me share what my thoughts have been to date.
>>>
>>>> SQL Row maps to a drill record?
>>> Correct
>>>
>>>> And drill would not have a flat sibling structure of nodes, a.k.a.
>> columns
>>> but hierarchical nodes?
>>> Correct.  My general thinking is that a record is a DataValue.
>>> A DataValue can be one of three major types: a map (string:DataValue), an
>>> ordered list (DataValues[]), or a scalar DataValue.  Most commonly, the
>>> first DataValue in a record would be a map.  In the case of SQL/flat data
>>> (e.g. CSV), this map would only contain scalar values.
>>>
>>>> Will drill access the contents of a record in a stream or document
>> manner?
>>> How large may i record be?
>>> For the first version of Drill, I was thinking that a record must fit
>>> entirely in memory.  Functions can interact with an entire record as they
>>> choose.
>>>
>>>> Can i use Xpath like functions to acces nodes?
>>> Generally, we hope so.  'Like' being the operative word here.  The path
>>> expressions that we're thinking of using are substantially simpler than
>> the
>>> expressiveness of xpath.  Ultimately, I could see people creating a
>> parser
>>> which takes in xquerys and converts them to Drill logical plans.  That
>>> being said, our goal is more for analytical queries than document
>>> transformations.
>>>
>>>> All of the google bigquery Cook Book Examples seem to generate flat
>>> Output, is this a limitation?
>>> In Drill, we don't plan to limit to flat output.  For v1, we're looking
>> at
>>> supporting hierarchical expressions in sql 'as' aliases.  We're also
>>> looking at supporting selections at any level of hierarchy, not just the
>>> leaf level.  We then combine these with a concept of collision behavior
>>> control so that you can control how to merge multiple nested out values
>>> into a single output tree.  These will allow one to build a nested output
>>> object.  These are preliminary thoughts.  We need to write more and
>> discuss
>>> more.
>>>
>>> One thing to remember is that one of Drill's goals is to be flexible.
>>> Ultimately, different query languages may support different subsets of
>>> operations and no one query language may include all operators.
>>>
>>> Hope that makes sense.
>>>
>>> Jacques
>>>
>>> On Sat, Jan 19, 2013 at 3:11 PM, Siprell, Stefan
>>> <st...@exxeta.de>wrote:
>>>
>>>> Aaaah studying the Big query docs helped. I may assume, that a SQL Row
>>>> maps to a drill record? And drill would not have a flat sibling
>> structure
>>>> of nodes, a.k.a. columns but hierarchical nodes?   All of the google
>>>> bigquery Cook Book Examples seem to generate flat Output, is this a
>>>> limitation? If not how would i generate my hierarchical Output Model,
>>>> without using a groovy builder or xquery :-)
>>>>
>>>>
>>>> Stefan
>>>>
>>>> Von meinem iPad gesendet
>>>>
>>>> Am 20.01.2013 um 00:01 schrieb "Jacques Nadeau" <
>> jacques.drill@gmail.com>:
>>>>
>>>>> Fair enough.  Starting with big query syntax or SQL 2003 and flat data
>>>>> structures will work fine.  I'll try to write something meaningful up
>>>> about
>>>>> sql and nested data structures.
>>>>>
>>>>> Jacques
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jan 19, 2013 at 2:54 PM, Siprell, Stefan
>>>>> <st...@exxeta.de>wrote:
>>>>>
>>>>>> Should I not just use this here as a reference?
>>>>>>
>>>>>> https://developers.google.com/bigquery/docs/query-reference
>>>>>>
>>>>>> I am a bit stumped to be honest. I am trying to think how to use SQL
>>>>>> efficiently on Nested Data sturctures.
>>>>>>
>>>>>> Von meinem iPad gesendet
>>>>>>
>>>>>> Am 19.01.2013 um 19:51 schrieb "Jacques Nadeau" <
>>>> jacques.drill@gmail.com
>>>>>> <ma...@gmail.com>>:
>>>>>>
>>>>>>
>>>>>>
>>>>>> * I drew a UML diagram. I saw that there is some glifffy support in
>>>>>> confluenc,e but the free account is pretty much useless. I used omni
>>>>>> graffle to draw the diagram, but this is payware on the mac - is there
>>>> some
>>>>>> usable freeware alternative? Don't mention tigris :-)
>>>>>>
>>>>>>
>>>>>> I don't have any suggestions on this.
>>>>>>
>>>>>>
>>>>>> * I have some ideas on the queries, but I am not sure how I should
>>>> specify
>>>>>> them? Should I use pseudo SQL? Prose? I saw the syntax document on the
>>>>>> server, it it mature enough, that I attempt to use its syntax? Is
>> there
>>>> a
>>>>>> BNF or better ANTLR grammar I can use to check my syntax? Should I
>> draw
>>>> one
>>>>>> up while I am at it?
>>>>>>
>>>>>>
>>>>>> I suggest you target SQL2003 (including subqueries).  We're looking at
>>>> how
>>>>>> to use Optiq's SQL parser for Drill.  Our goal is to stay as close as
>>>>>> possible to that spec but add the following extensions:
>>>>>> - Add flatten operator similar to BigQuery syntax
>>>>>> - Support use of selection and output identifiers using
>> dotted/bracketed
>>>>>> notation.  E.g. "select person.children[0].age as
>>>>>> output.profile.firstChildAge"
>>>>>> - Support new functions that can accept nested values including
>>>> collections
>>>>>> and maps.  For example "select ARRAY_LENGTH(person.children)".
>>>>>>
>>>>>> Once you have some sql examples, the next goal would be to manually
>>>>>> translate those into Logical Plan syntax.  This syntax is still
>>>> maturing so
>>>>>> I'd take it to the SQL stage first.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Stefan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 19.01.2013, at 02:05, Jacques Nadeau <jacques.drill@gmail.com
>>>> <mailto:
>>>>>> jacques.drill@gmail.com>> wrote:
>>>>>>
>>>>>> The wiki is up.  Michael and Stefan, it would be great if you started
>>>>>> putting your use case thoughts there.
>>>>>>
>>>>>> Jacques
>>>>>>
>>>>>> On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <ted.dunning@gmail.com
>>>>>> <ma...@gmail.com>>
>>>>>> wrote:
>>>>>>
>>>>>> Ahh... yes.  That wiki.  I will ping infra again.
>>>>>>
>>>>>> (I was attaching your comment to the wikipedia use case and had
>> confused
>>>>>> myself)
>>>>>>
>>>>>> On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
>>>>>> michael.hausenblas@gmail.com<ma...@gmail.com>>
>>>> wrote:
>>>>>>
>>>>>>
>>>>>> What do you need from me?
>>>>>>
>>>>>> Maybe I've overlooked something in which case I apologize - was
>>>>>> wondering
>>>>>> if the public Wiki for Drill is available where Stefan, I and others
>>>>>> can
>>>>>> write up the UC and queries.
>>>>>>
>>>>>> Cheers,
>>>>>>            Michael
>>>>>>
>>>>>> --
>>>>>> Michael Hausenblas
>>>>>> Ireland, Europe
>>>>>> http://mhausenblas.info/
>>>>>>
>>>>>> On 13 Jan 2013, at 14:20, Ted Dunning <ted.dunning@gmail.com<mailto:
>>>>>> ted.dunning@gmail.com>> wrote:
>>>>>>
>>>>>> What do you need from me?
>>>>>>
>>>>>>
>>>>>> On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
>>>>>> michael.hausenblas@gmail.com<ma...@gmail.com>>
>>>> wrote:
>>>>>>
>>>>>> as soon as we hear back from Ted re the Wiki we work there.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>
>>

Re: Introduction

Posted by Jacques Nadeau <ja...@gmail.com>.

I spent a little time looking at your first query.  I think, for all the
queries, writing a little more description of the query goals would be
helpful to ensure that I'm not misinterpreting your objective.

select rev.::parent.title, rev.::parent.id, sum(rev.text.bytes)
from mediawiki.page.revision as rev
where rev.timestamp.between(?, ?)
group by rev.::parent;

If I were trying to make it more SQL'y, I'd probably go with something like:

select
  pageTitle,
  pageId,
  sum(rev.bytes) as totalChanges
from (
  select
    mediawiki.page.title as pageTitle,
    mediawiki.page.id as pageId,
    flatten(media.page.revision) as rev
  from mediawiki
  where rev.timestamp between ? and ?
) as recordPerRevision
group by pageTitle, pageId
order by totalChanges desc

A cleaner alternative would be providing the more complicated WITHIN syntax
as BigQuery does:

 select
    mediawiki.page.title as pageTitle,
    mediawiki.page.id as pageId,
    sum(media.page.revision.bytes) as totalChanges within RECORD
  from mediawiki
  where rev.timestamp between ? and ?


Or extending the SQL2003 windowing functions such as that partitioning
within a single record is possible and then makes the aggregating functions
use relative references.

 select
    mediawiki.page.title as pageTitle,
    mediawiki.page.id as pageId,
    sum(bytes) as totalChanges OVER(PARTITION BY mediawiki.page.revision)
  where rev.timestamp between ? and ?

Or providing the simple approach, providing a specialized 'scalar'
function: ARRAY_SUM(array_node_to_iterate_over,
expression_to_evaluate_on_each_iterated_value) function:

 select
    mediawiki.page.title as pageTitle,
    mediawiki.page.id as pageId,
    ARRAY_SUM(media.page.revision, bytes) as totalChanges
  where rev.timestamp between ? and ?


>
> I also understood drill was more of an analytical platform. So my
> understanding is that we want to access hierarchical data, but we do not
> want to generate any. Besides trying to run reports, charts or tables
> (typical client applications) on hierarchical data is a mess, as the
> toolset simply doesn't support it. Out of this reason, I would focus on
> generating flat result for the time being.
>
>
I think this is a really great point. It made me question some of the
assumptions I had been operating on.  That being said, I'd like to hold off
on trimming that tree entirely for the time being.  I'm concerned doing so
would substantial the effectiveness of ever using nested datasets with it.
 For example, if I do select * from a nested dataset, I really want to see
hierarchical data returned.  In the case of building up a single query on a
number of sub queries, I can see many useful situations where the
intermediate queries still maintain hierarchical datasets, even if the
final goal output might be a flat data structure for analytical tool use.



> If desired I can start writing an ANTLR grammar on the stuff I am working
> on, to make the output more robust. I had a look at the SQL parser you guys
> mentioned, but I don't think this would work on my kind of queries, as it
>  drastically expands SQL 2003. All we want to do is to map the AST to your
> logic plan? I think this can be done quite easily just using ANTLR and some
> Java classes.
>

If you want to build a simple query language that generates logical plans,
that would be interesting.  Given my rewrites, are you still skeptical of
minimally extending SQL 2003?

Jacques


>
> Stefan
>
> On 20.01.2013, at 00:56, Jacques Nadeau <ja...@gmail.com> wrote:
>
> > Many of these haven't been finalized since we're still working on code.
> > That being said, let me share what my thoughts have been to date.
> >
> >> SQL Row maps to a drill record?
> > Correct
> >
> >> And drill would not have a flat sibling structure of nodes, a.k.a.
> columns
> > but hierarchical nodes?
> > Correct.  My general thinking is that a record is a DataValue.
> > A DataValue can be one of three major types: a map (string:DataValue), an
> > ordered list (DataValues[]), or a scalar DataValue.  Most commonly, the
> > first DataValue in a record would be a map.  In the case of SQL/flat data
> > (e.g. CSV), this map would only contain scalar values.
> >
> >> Will drill access the contents of a record in a stream or document
> manner?
> > How large may i record be?
> > For the first version of Drill, I was thinking that a record must fit
> > entirely in memory.  Functions can interact with an entire record as they
> > choose.
> >
> >> Can i use Xpath like functions to acces nodes?
> > Generally, we hope so.  'Like' being the operative word here.  The path
> > expressions that we're thinking of using are substantially simpler than
> the
> > expressiveness of xpath.  Ultimately, I could see people creating a
> parser
> > which takes in xquerys and converts them to Drill logical plans.  That
> > being said, our goal is more for analytical queries than document
> > transformations.
> >
> >> All of the google bigquery Cook Book Examples seem to generate flat
> > Output, is this a limitation?
> > In Drill, we don't plan to limit to flat output.  For v1, we're looking
> at
> > supporting hierarchical expressions in sql 'as' aliases.  We're also
> > looking at supporting selections at any level of hierarchy, not just the
> > leaf level.  We then combine these with a concept of collision behavior
> > control so that you can control how to merge multiple nested out values
> > into a single output tree.  These will allow one to build a nested output
> > object.  These are preliminary thoughts.  We need to write more and
> discuss
> > more.
> >
> > One thing to remember is that one of Drill's goals is to be flexible.
> > Ultimately, different query languages may support different subsets of
> > operations and no one query language may include all operators.
> >
> > Hope that makes sense.
> >
> > Jacques
> >
> > On Sat, Jan 19, 2013 at 3:11 PM, Siprell, Stefan
> > <st...@exxeta.de>wrote:
> >
> >> Aaaah studying the Big query docs helped. I may assume, that a SQL Row
> >> maps to a drill record? And drill would not have a flat sibling
> structure
> >> of nodes, a.k.a. columns but hierarchical nodes?   All of the google
> >> bigquery Cook Book Examples seem to generate flat Output, is this a
> >> limitation? If not how would i generate my hierarchical Output Model,
> >> without using a groovy builder or xquery :-)
> >>
> >>
> >> Stefan
> >>
> >> Von meinem iPad gesendet
> >>
> >> Am 20.01.2013 um 00:01 schrieb "Jacques Nadeau" <
> jacques.drill@gmail.com>:
> >>
> >>> Fair enough.  Starting with big query syntax or SQL 2003 and flat data
> >>> structures will work fine.  I'll try to write something meaningful up
> >> about
> >>> sql and nested data structures.
> >>>
> >>> Jacques
> >>>
> >>>
> >>>
> >>> On Sat, Jan 19, 2013 at 2:54 PM, Siprell, Stefan
> >>> <st...@exxeta.de>wrote:
> >>>
> >>>> Should I not just use this here as a reference?
> >>>>
> >>>> https://developers.google.com/bigquery/docs/query-reference
> >>>>
> >>>> I am a bit stumped to be honest. I am trying to think how to use SQL
> >>>> efficiently on Nested Data sturctures.
> >>>>
> >>>> Von meinem iPad gesendet
> >>>>
> >>>> Am 19.01.2013 um 19:51 schrieb "Jacques Nadeau" <
> >> jacques.drill@gmail.com
> >>>> <ma...@gmail.com>>:
> >>>>
> >>>>
> >>>>
> >>>> * I drew a UML diagram. I saw that there is some glifffy support in
> >>>> confluenc,e but the free account is pretty much useless. I used omni
> >>>> graffle to draw the diagram, but this is payware on the mac - is there
> >> some
> >>>> usable freeware alternative? Don't mention tigris :-)
> >>>>
> >>>>
> >>>> I don't have any suggestions on this.
> >>>>
> >>>>
> >>>> * I have some ideas on the queries, but I am not sure how I should
> >> specify
> >>>> them? Should I use pseudo SQL? Prose? I saw the syntax document on the
> >>>> server, it it mature enough, that I attempt to use its syntax? Is
> there
> >> a
> >>>> BNF or better ANTLR grammar I can use to check my syntax? Should I
> draw
> >> one
> >>>> up while I am at it?
> >>>>
> >>>>
> >>>> I suggest you target SQL2003 (including subqueries).  We're looking at
> >> how
> >>>> to use Optiq's SQL parser for Drill.  Our goal is to stay as close as
> >>>> possible to that spec but add the following extensions:
> >>>> - Add flatten operator similar to BigQuery syntax
> >>>> - Support use of selection and output identifiers using
> dotted/bracketed
> >>>> notation.  E.g. "select person.children[0].age as
> >>>> output.profile.firstChildAge"
> >>>> - Support new functions that can accept nested values including
> >> collections
> >>>> and maps.  For example "select ARRAY_LENGTH(person.children)".
> >>>>
> >>>> Once you have some sql examples, the next goal would be to manually
> >>>> translate those into Logical Plan syntax.  This syntax is still
> >> maturing so
> >>>> I'd take it to the SQL stage first.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Stefan
> >>>>
> >>>>
> >>>>
> >>>> On 19.01.2013, at 02:05, Jacques Nadeau <jacques.drill@gmail.com
> >> <mailto:
> >>>> jacques.drill@gmail.com>> wrote:
> >>>>
> >>>> The wiki is up.  Michael and Stefan, it would be great if you started
> >>>> putting your use case thoughts there.
> >>>>
> >>>> Jacques
> >>>>
> >>>> On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <ted.dunning@gmail.com
> >>>> <ma...@gmail.com>>
> >>>> wrote:
> >>>>
> >>>> Ahh... yes.  That wiki.  I will ping infra again.
> >>>>
> >>>> (I was attaching your comment to the wikipedia use case and had
> confused
> >>>> myself)
> >>>>
> >>>> On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
> >>>> michael.hausenblas@gmail.com<ma...@gmail.com>>
> >> wrote:
> >>>>
> >>>>
> >>>> What do you need from me?
> >>>>
> >>>> Maybe I've overlooked something in which case I apologize - was
> >>>> wondering
> >>>> if the public Wiki for Drill is available where Stefan, I and others
> >>>> can
> >>>> write up the UC and queries.
> >>>>
> >>>> Cheers,
> >>>>             Michael
> >>>>
> >>>> --
> >>>> Michael Hausenblas
> >>>> Ireland, Europe
> >>>> http://mhausenblas.info/
> >>>>
> >>>> On 13 Jan 2013, at 14:20, Ted Dunning <ted.dunning@gmail.com<mailto:
> >>>> ted.dunning@gmail.com>> wrote:
> >>>>
> >>>> What do you need from me?
> >>>>
> >>>>
> >>>> On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
> >>>> michael.hausenblas@gmail.com<ma...@gmail.com>>
> >> wrote:
> >>>>
> >>>> as soon as we hear back from Ted re the Wiki we work there.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>
>
>

Re: Introduction

Posted by "Siprell, Stefan" <st...@exxeta.de>.

Good morning Jaques,

I have added some queries now using your great feedback. I got a little creative on SQL extensions for DataValues, and documented this inline with my queries. I stumbled on a question regarding indexes and DataValues. Will the index point to a record or will it point to a subrecord element? I wrote this down with my query examples, but this seems to be more general question, so I thought I should repeat it in the dev mailing list. I started drafting my queries using like expressions, but found this unnatural, so I moved towards inlining the hierarchical elements into the statement itself.

I also understood drill was more of an analytical platform. So my understanding is that we want to access hierarchical data, but we do not want to generate any. Besides trying to run reports, charts or tables (typical client applications) on hierarchical data is a mess, as the toolset simply doesn't support it. Out of this reason, I would focus on generating flat result for the time being.

If desired I can start writing an ANTLR grammar on the stuff I am working on, to make the output more robust. I had a look at the SQL parser you guys mentioned, but I don't think this would work on my kind of queries, as it  drastically expands SQL 2003. All we want to do is to map the AST to your logic plan? I think this can be done quite easily just using ANTLR and some Java classes.

Stefan

On 20.01.2013, at 00:56, Jacques Nadeau <ja...@gmail.com> wrote:

> Many of these haven't been finalized since we're still working on code.
> That being said, let me share what my thoughts have been to date.
> 
>> SQL Row maps to a drill record?
> Correct
> 
>> And drill would not have a flat sibling structure of nodes, a.k.a. columns
> but hierarchical nodes?
> Correct.  My general thinking is that a record is a DataValue.
> A DataValue can be one of three major types: a map (string:DataValue), an
> ordered list (DataValues[]), or a scalar DataValue.  Most commonly, the
> first DataValue in a record would be a map.  In the case of SQL/flat data
> (e.g. CSV), this map would only contain scalar values.
> 
>> Will drill access the contents of a record in a stream or document manner?
> How large may i record be?
> For the first version of Drill, I was thinking that a record must fit
> entirely in memory.  Functions can interact with an entire record as they
> choose.
> 
>> Can i use Xpath like functions to acces nodes?
> Generally, we hope so.  'Like' being the operative word here.  The path
> expressions that we're thinking of using are substantially simpler than the
> expressiveness of xpath.  Ultimately, I could see people creating a parser
> which takes in xquerys and converts them to Drill logical plans.  That
> being said, our goal is more for analytical queries than document
> transformations.
> 
>> All of the google bigquery Cook Book Examples seem to generate flat
> Output, is this a limitation?
> In Drill, we don't plan to limit to flat output.  For v1, we're looking at
> supporting hierarchical expressions in sql 'as' aliases.  We're also
> looking at supporting selections at any level of hierarchy, not just the
> leaf level.  We then combine these with a concept of collision behavior
> control so that you can control how to merge multiple nested out values
> into a single output tree.  These will allow one to build a nested output
> object.  These are preliminary thoughts.  We need to write more and discuss
> more.
> 
> One thing to remember is that one of Drill's goals is to be flexible.
> Ultimately, different query languages may support different subsets of
> operations and no one query language may include all operators.
> 
> Hope that makes sense.
> 
> Jacques
> 
> On Sat, Jan 19, 2013 at 3:11 PM, Siprell, Stefan
> <st...@exxeta.de>wrote:
> 
>> Aaaah studying the Big query docs helped. I may assume, that a SQL Row
>> maps to a drill record? And drill would not have a flat sibling structure
>> of nodes, a.k.a. columns but hierarchical nodes?   All of the google
>> bigquery Cook Book Examples seem to generate flat Output, is this a
>> limitation? If not how would i generate my hierarchical Output Model,
>> without using a groovy builder or xquery :-)
>> 
>> 
>> Stefan
>> 
>> Von meinem iPad gesendet
>> 
>> Am 20.01.2013 um 00:01 schrieb "Jacques Nadeau" <ja...@gmail.com>:
>> 
>>> Fair enough.  Starting with big query syntax or SQL 2003 and flat data
>>> structures will work fine.  I'll try to write something meaningful up
>> about
>>> sql and nested data structures.
>>> 
>>> Jacques
>>> 
>>> 
>>> 
>>> On Sat, Jan 19, 2013 at 2:54 PM, Siprell, Stefan
>>> <st...@exxeta.de>wrote:
>>> 
>>>> Should I not just use this here as a reference?
>>>> 
>>>> https://developers.google.com/bigquery/docs/query-reference
>>>> 
>>>> I am a bit stumped to be honest. I am trying to think how to use SQL
>>>> efficiently on Nested Data sturctures.
>>>> 
>>>> Von meinem iPad gesendet
>>>> 
>>>> Am 19.01.2013 um 19:51 schrieb "Jacques Nadeau" <
>> jacques.drill@gmail.com
>>>> <ma...@gmail.com>>:
>>>> 
>>>> 
>>>> 
>>>> * I drew a UML diagram. I saw that there is some glifffy support in
>>>> confluenc,e but the free account is pretty much useless. I used omni
>>>> graffle to draw the diagram, but this is payware on the mac - is there
>> some
>>>> usable freeware alternative? Don't mention tigris :-)
>>>> 
>>>> 
>>>> I don't have any suggestions on this.
>>>> 
>>>> 
>>>> * I have some ideas on the queries, but I am not sure how I should
>> specify
>>>> them? Should I use pseudo SQL? Prose? I saw the syntax document on the
>>>> server, it it mature enough, that I attempt to use its syntax? Is there
>> a
>>>> BNF or better ANTLR grammar I can use to check my syntax? Should I draw
>> one
>>>> up while I am at it?
>>>> 
>>>> 
>>>> I suggest you target SQL2003 (including subqueries).  We're looking at
>> how
>>>> to use Optiq's SQL parser for Drill.  Our goal is to stay as close as
>>>> possible to that spec but add the following extensions:
>>>> - Add flatten operator similar to BigQuery syntax
>>>> - Support use of selection and output identifiers using dotted/bracketed
>>>> notation.  E.g. "select person.children[0].age as
>>>> output.profile.firstChildAge"
>>>> - Support new functions that can accept nested values including
>> collections
>>>> and maps.  For example "select ARRAY_LENGTH(person.children)".
>>>> 
>>>> Once you have some sql examples, the next goal would be to manually
>>>> translate those into Logical Plan syntax.  This syntax is still
>> maturing so
>>>> I'd take it to the SQL stage first.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Stefan
>>>> 
>>>> 
>>>> 
>>>> On 19.01.2013, at 02:05, Jacques Nadeau <jacques.drill@gmail.com
>> <mailto:
>>>> jacques.drill@gmail.com>> wrote:
>>>> 
>>>> The wiki is up.  Michael and Stefan, it would be great if you started
>>>> putting your use case thoughts there.
>>>> 
>>>> Jacques
>>>> 
>>>> On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <ted.dunning@gmail.com
>>>> <ma...@gmail.com>>
>>>> wrote:
>>>> 
>>>> Ahh... yes.  That wiki.  I will ping infra again.
>>>> 
>>>> (I was attaching your comment to the wikipedia use case and had confused
>>>> myself)
>>>> 
>>>> On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
>>>> michael.hausenblas@gmail.com<ma...@gmail.com>>
>> wrote:
>>>> 
>>>> 
>>>> What do you need from me?
>>>> 
>>>> Maybe I've overlooked something in which case I apologize - was
>>>> wondering
>>>> if the public Wiki for Drill is available where Stefan, I and others
>>>> can
>>>> write up the UC and queries.
>>>> 
>>>> Cheers,
>>>>             Michael
>>>> 
>>>> --
>>>> Michael Hausenblas
>>>> Ireland, Europe
>>>> http://mhausenblas.info/
>>>> 
>>>> On 13 Jan 2013, at 14:20, Ted Dunning <ted.dunning@gmail.com<mailto:
>>>> ted.dunning@gmail.com>> wrote:
>>>> 
>>>> What do you need from me?
>>>> 
>>>> 
>>>> On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
>>>> michael.hausenblas@gmail.com<ma...@gmail.com>>
>> wrote:
>>>> 
>>>> as soon as we hear back from Ted re the Wiki we work there.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>

Re: Introduction

Posted by Jacques Nadeau <ja...@gmail.com>.

Many of these haven't been finalized since we're still working on code.
 That being said, let me share what my thoughts have been to date.

>SQL Row maps to a drill record?
Correct

>And drill would not have a flat sibling structure of nodes, a.k.a. columns
but hierarchical nodes?
Correct.  My general thinking is that a record is a DataValue.
 A DataValue can be one of three major types: a map (string:DataValue), an
ordered list (DataValues[]), or a scalar DataValue.  Most commonly, the
first DataValue in a record would be a map.  In the case of SQL/flat data
(e.g. CSV), this map would only contain scalar values.

>Will drill access the contents of a record in a stream or document manner?
How large may i record be?
For the first version of Drill, I was thinking that a record must fit
entirely in memory.  Functions can interact with an entire record as they
choose.

>Can i use Xpath like functions to acces nodes?
Generally, we hope so.  'Like' being the operative word here.  The path
expressions that we're thinking of using are substantially simpler than the
expressiveness of xpath.  Ultimately, I could see people creating a parser
which takes in xquerys and converts them to Drill logical plans.  That
being said, our goal is more for analytical queries than document
transformations.

>All of the google bigquery Cook Book Examples seem to generate flat
Output, is this a limitation?
In Drill, we don't plan to limit to flat output.  For v1, we're looking at
supporting hierarchical expressions in sql 'as' aliases.  We're also
looking at supporting selections at any level of hierarchy, not just the
leaf level.  We then combine these with a concept of collision behavior
control so that you can control how to merge multiple nested out values
into a single output tree.  These will allow one to build a nested output
object.  These are preliminary thoughts.  We need to write more and discuss
more.

One thing to remember is that one of Drill's goals is to be flexible.
 Ultimately, different query languages may support different subsets of
operations and no one query language may include all operators.

Hope that makes sense.

Jacques

On Sat, Jan 19, 2013 at 3:11 PM, Siprell, Stefan
<st...@exxeta.de>wrote:

> Aaaah studying the Big query docs helped. I may assume, that a SQL Row
> maps to a drill record? And drill would not have a flat sibling structure
> of nodes, a.k.a. columns but hierarchical nodes?   All of the google
> bigquery Cook Book Examples seem to generate flat Output, is this a
> limitation? If not how would i generate my hierarchical Output Model,
> without using a groovy builder or xquery :-)
>
>
> Stefan
>
> Von meinem iPad gesendet
>
> Am 20.01.2013 um 00:01 schrieb "Jacques Nadeau" <ja...@gmail.com>:
>
> > Fair enough.  Starting with big query syntax or SQL 2003 and flat data
> > structures will work fine.  I'll try to write something meaningful up
> about
> > sql and nested data structures.
> >
> > Jacques
> >
> >
> >
> > On Sat, Jan 19, 2013 at 2:54 PM, Siprell, Stefan
> > <st...@exxeta.de>wrote:
> >
> >> Should I not just use this here as a reference?
> >>
> >> https://developers.google.com/bigquery/docs/query-reference
> >>
> >> I am a bit stumped to be honest. I am trying to think how to use SQL
> >> efficiently on Nested Data sturctures.
> >>
> >> Von meinem iPad gesendet
> >>
> >> Am 19.01.2013 um 19:51 schrieb "Jacques Nadeau" <
> jacques.drill@gmail.com
> >> <ma...@gmail.com>>:
> >>
> >>
> >>
> >> * I drew a UML diagram. I saw that there is some glifffy support in
> >> confluenc,e but the free account is pretty much useless. I used omni
> >> graffle to draw the diagram, but this is payware on the mac - is there
> some
> >> usable freeware alternative? Don't mention tigris :-)
> >>
> >>
> >> I don't have any suggestions on this.
> >>
> >>
> >> * I have some ideas on the queries, but I am not sure how I should
> specify
> >> them? Should I use pseudo SQL? Prose? I saw the syntax document on the
> >> server, it it mature enough, that I attempt to use its syntax? Is there
> a
> >> BNF or better ANTLR grammar I can use to check my syntax? Should I draw
> one
> >> up while I am at it?
> >>
> >>
> >> I suggest you target SQL2003 (including subqueries).  We're looking at
> how
> >> to use Optiq's SQL parser for Drill.  Our goal is to stay as close as
> >> possible to that spec but add the following extensions:
> >> - Add flatten operator similar to BigQuery syntax
> >> - Support use of selection and output identifiers using dotted/bracketed
> >> notation.  E.g. "select person.children[0].age as
> >> output.profile.firstChildAge"
> >> - Support new functions that can accept nested values including
> collections
> >> and maps.  For example "select ARRAY_LENGTH(person.children)".
> >>
> >> Once you have some sql examples, the next goal would be to manually
> >> translate those into Logical Plan syntax.  This syntax is still
> maturing so
> >> I'd take it to the SQL stage first.
> >>
> >>
> >>
> >>
> >>
> >>
> >> Stefan
> >>
> >>
> >>
> >> On 19.01.2013, at 02:05, Jacques Nadeau <jacques.drill@gmail.com
> <mailto:
> >> jacques.drill@gmail.com>> wrote:
> >>
> >> The wiki is up.  Michael and Stefan, it would be great if you started
> >> putting your use case thoughts there.
> >>
> >> Jacques
> >>
> >> On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <ted.dunning@gmail.com
> >> <ma...@gmail.com>>
> >> wrote:
> >>
> >> Ahh... yes.  That wiki.  I will ping infra again.
> >>
> >> (I was attaching your comment to the wikipedia use case and had confused
> >> myself)
> >>
> >> On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
> >> michael.hausenblas@gmail.com<ma...@gmail.com>>
> wrote:
> >>
> >>
> >> What do you need from me?
> >>
> >> Maybe I've overlooked something in which case I apologize - was
> >> wondering
> >> if the public Wiki for Drill is available where Stefan, I and others
> >> can
> >> write up the UC and queries.
> >>
> >> Cheers,
> >>              Michael
> >>
> >> --
> >> Michael Hausenblas
> >> Ireland, Europe
> >> http://mhausenblas.info/
> >>
> >> On 13 Jan 2013, at 14:20, Ted Dunning <ted.dunning@gmail.com<mailto:
> >> ted.dunning@gmail.com>> wrote:
> >>
> >> What do you need from me?
> >>
> >>
> >> On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
> >> michael.hausenblas@gmail.com<ma...@gmail.com>>
> wrote:
> >>
> >> as soon as we hear back from Ted re the Wiki we work there.
> >>
> >>
> >>
> >>
> >>
> >>
>

Re: Introduction

Posted by "Siprell, Stefan" <st...@exxeta.de>.

Aaaah studying the Big query docs helped. I may assume, that a SQL Row maps to a drill record? And drill would not have a flat sibling structure of nodes, a.k.a. columns but hierarchical nodes? Will drill access the contents of a record in a stream or document manner? How large may i record be? Can i use Xpath like functions to acces nodes? All of the google bigquery Cook Book Examples seem to generate flat Output, is this a limitation? If not how would i generate my hierarchical Output Model, without using a groovy builder or xquery :-)


Stefan

Von meinem iPad gesendet

Am 20.01.2013 um 00:01 schrieb "Jacques Nadeau" <ja...@gmail.com>:

> Fair enough.  Starting with big query syntax or SQL 2003 and flat data
> structures will work fine.  I'll try to write something meaningful up about
> sql and nested data structures.
> 
> Jacques
> 
> 
> 
> On Sat, Jan 19, 2013 at 2:54 PM, Siprell, Stefan
> <st...@exxeta.de>wrote:
> 
>> Should I not just use this here as a reference?
>> 
>> https://developers.google.com/bigquery/docs/query-reference
>> 
>> I am a bit stumped to be honest. I am trying to think how to use SQL
>> efficiently on Nested Data sturctures.
>> 
>> Von meinem iPad gesendet
>> 
>> Am 19.01.2013 um 19:51 schrieb "Jacques Nadeau" <jacques.drill@gmail.com
>> <ma...@gmail.com>>:
>> 
>> 
>> 
>> * I drew a UML diagram. I saw that there is some glifffy support in
>> confluenc,e but the free account is pretty much useless. I used omni
>> graffle to draw the diagram, but this is payware on the mac - is there some
>> usable freeware alternative? Don't mention tigris :-)
>> 
>> 
>> I don't have any suggestions on this.
>> 
>> 
>> * I have some ideas on the queries, but I am not sure how I should specify
>> them? Should I use pseudo SQL? Prose? I saw the syntax document on the
>> server, it it mature enough, that I attempt to use its syntax? Is there a
>> BNF or better ANTLR grammar I can use to check my syntax? Should I draw one
>> up while I am at it?
>> 
>> 
>> I suggest you target SQL2003 (including subqueries).  We're looking at how
>> to use Optiq's SQL parser for Drill.  Our goal is to stay as close as
>> possible to that spec but add the following extensions:
>> - Add flatten operator similar to BigQuery syntax
>> - Support use of selection and output identifiers using dotted/bracketed
>> notation.  E.g. "select person.children[0].age as
>> output.profile.firstChildAge"
>> - Support new functions that can accept nested values including collections
>> and maps.  For example "select ARRAY_LENGTH(person.children)".
>> 
>> Once you have some sql examples, the next goal would be to manually
>> translate those into Logical Plan syntax.  This syntax is still maturing so
>> I'd take it to the SQL stage first.
>> 
>> 
>> 
>> 
>> 
>> 
>> Stefan
>> 
>> 
>> 
>> On 19.01.2013, at 02:05, Jacques Nadeau <jacques.drill@gmail.com<mailto:
>> jacques.drill@gmail.com>> wrote:
>> 
>> The wiki is up.  Michael and Stefan, it would be great if you started
>> putting your use case thoughts there.
>> 
>> Jacques
>> 
>> On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <ted.dunning@gmail.com
>> <ma...@gmail.com>>
>> wrote:
>> 
>> Ahh... yes.  That wiki.  I will ping infra again.
>> 
>> (I was attaching your comment to the wikipedia use case and had confused
>> myself)
>> 
>> On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
>> michael.hausenblas@gmail.com<ma...@gmail.com>> wrote:
>> 
>> 
>> What do you need from me?
>> 
>> Maybe I've overlooked something in which case I apologize - was
>> wondering
>> if the public Wiki for Drill is available where Stefan, I and others
>> can
>> write up the UC and queries.
>> 
>> Cheers,
>>              Michael
>> 
>> --
>> Michael Hausenblas
>> Ireland, Europe
>> http://mhausenblas.info/
>> 
>> On 13 Jan 2013, at 14:20, Ted Dunning <ted.dunning@gmail.com<mailto:
>> ted.dunning@gmail.com>> wrote:
>> 
>> What do you need from me?
>> 
>> 
>> On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
>> michael.hausenblas@gmail.com<ma...@gmail.com>> wrote:
>> 
>> as soon as we hear back from Ted re the Wiki we work there.
>> 
>> 
>> 
>> 
>> 
>>

Re: Introduction

Posted by Jacques Nadeau <ja...@gmail.com>.

Fair enough.  Starting with big query syntax or SQL 2003 and flat data
structures will work fine.  I'll try to write something meaningful up about
sql and nested data structures.

Jacques



On Sat, Jan 19, 2013 at 2:54 PM, Siprell, Stefan
<st...@exxeta.de>wrote:

> Should I not just use this here as a reference?
>
> https://developers.google.com/bigquery/docs/query-reference
>
> I am a bit stumped to be honest. I am trying to think how to use SQL
> efficiently on Nested Data sturctures.
>
> Von meinem iPad gesendet
>
> Am 19.01.2013 um 19:51 schrieb "Jacques Nadeau" <jacques.drill@gmail.com
> <ma...@gmail.com>>:
>
>
>
> * I drew a UML diagram. I saw that there is some glifffy support in
> confluenc,e but the free account is pretty much useless. I used omni
> graffle to draw the diagram, but this is payware on the mac - is there some
> usable freeware alternative? Don't mention tigris :-)
>
>
> I don't have any suggestions on this.
>
>
> * I have some ideas on the queries, but I am not sure how I should specify
> them? Should I use pseudo SQL? Prose? I saw the syntax document on the
> server, it it mature enough, that I attempt to use its syntax? Is there a
> BNF or better ANTLR grammar I can use to check my syntax? Should I draw one
> up while I am at it?
>
>
> I suggest you target SQL2003 (including subqueries).  We're looking at how
> to use Optiq's SQL parser for Drill.  Our goal is to stay as close as
> possible to that spec but add the following extensions:
> - Add flatten operator similar to BigQuery syntax
> - Support use of selection and output identifiers using dotted/bracketed
> notation.  E.g. "select person.children[0].age as
> output.profile.firstChildAge"
> - Support new functions that can accept nested values including collections
> and maps.  For example "select ARRAY_LENGTH(person.children)".
>
> Once you have some sql examples, the next goal would be to manually
> translate those into Logical Plan syntax.  This syntax is still maturing so
> I'd take it to the SQL stage first.
>
>
>
>
>
>
> Stefan
>
>
>
> On 19.01.2013, at 02:05, Jacques Nadeau <jacques.drill@gmail.com<mailto:
> jacques.drill@gmail.com>> wrote:
>
> The wiki is up.  Michael and Stefan, it would be great if you started
> putting your use case thoughts there.
>
> Jacques
>
> On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <ted.dunning@gmail.com
> <ma...@gmail.com>>
> wrote:
>
> Ahh... yes.  That wiki.  I will ping infra again.
>
> (I was attaching your comment to the wikipedia use case and had confused
> myself)
>
> On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
> michael.hausenblas@gmail.com<ma...@gmail.com>> wrote:
>
>
> What do you need from me?
>
> Maybe I've overlooked something in which case I apologize - was
> wondering
> if the public Wiki for Drill is available where Stefan, I and others
> can
> write up the UC and queries.
>
> Cheers,
>               Michael
>
> --
> Michael Hausenblas
> Ireland, Europe
> http://mhausenblas.info/
>
> On 13 Jan 2013, at 14:20, Ted Dunning <ted.dunning@gmail.com<mailto:
> ted.dunning@gmail.com>> wrote:
>
> What do you need from me?
>
>
> On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
> michael.hausenblas@gmail.com<ma...@gmail.com>> wrote:
>
> as soon as we hear back from Ted re the Wiki we work there.
>
>
>
>
>
>

Re: Introduction

Posted by "Siprell, Stefan" <st...@exxeta.de>.

Should I not just use this here as a reference?

https://developers.google.com/bigquery/docs/query-reference

I am a bit stumped to be honest. I am trying to think how to use SQL efficiently on Nested Data sturctures.

Von meinem iPad gesendet

Am 19.01.2013 um 19:51 schrieb "Jacques Nadeau" <ja...@gmail.com>>:

* I drew a UML diagram. I saw that there is some glifffy support in
confluenc,e but the free account is pretty much useless. I used omni
graffle to draw the diagram, but this is payware on the mac - is there some
usable freeware alternative? Don't mention tigris :-)

I don't have any suggestions on this.

* I have some ideas on the queries, but I am not sure how I should specify
them? Should I use pseudo SQL? Prose? I saw the syntax document on the
server, it it mature enough, that I attempt to use its syntax? Is there a
BNF or better ANTLR grammar I can use to check my syntax? Should I draw one
up while I am at it?

I suggest you target SQL2003 (including subqueries). We're looking at how
to use Optiq's SQL parser for Drill. Our goal is to stay as close as
possible to that spec but add the following extensions:
- Add flatten operator similar to BigQuery syntax
- Support use of selection and output identifiers using dotted/bracketed
notation. E.g. "select person.children[0].age as
output.profile.firstChildAge"
- Support new functions that can accept nested values including collections
and maps. For example "select ARRAY_LENGTH(person.children)".

Once you have some sql examples, the next goal would be to manually
translate those into Logical Plan syntax. This syntax is still maturing so
I'd take it to the SQL stage first.

Stefan

On 19.01.2013, at 02:05, Jacques Nadeau <ja...@gmail.com>> wrote:

The wiki is up. Michael and Stefan, it would be great if you started
putting your use case thoughts there.

Jacques

On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <te...@gmail.com>>
wrote:

Ahh... yes. That wiki. I will ping infra again.

(I was attaching your comment to the wikipedia use case and had confused
myself)

On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
michael.hausenblas@gmail.com<ma...@gmail.com>> wrote:

What do you need from me?

Maybe I've overlooked something in which case I apologize - was
wondering
if the public Wiki for Drill is available where Stefan, I and others
can
write up the UC and queries.

Cheers,
Michael

--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/

On 13 Jan 2013, at 14:20, Ted Dunning <te...@gmail.com>> wrote:

What do you need from me?

On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
michael.hausenblas@gmail.com<ma...@gmail.com>> wrote:

as soon as we hear back from Ted re the Wiki we work there.

Re: Introduction

Posted by Jacques Nadeau <ja...@gmail.com>.

Since we're going to support nested values/identifiers/references, an
additional option would be to skip the step of mapping to tables/columns
and query directly in nested format.  Since we're hoping to support maps
and arrays, it could be interesting to see how well that works.

Jacques

On Sat, Jan 19, 2013 at 2:39 PM, Siprell, Stefan
<st...@exxeta.de>wrote:

> That is what i am doing as the SQL dumps were too large. I was going to
> Map the XML to tables and columns to generate the SQL.
>
> Stefan
>
> Von meinem iPad gesendet
>
> Am 19.01.2013 um 23:21 schrieb "Jacques Nadeau" <ja...@gmail.com>:
>
> > Stefan, one other thought.  It might also be interesting to explore
> working
> > with the XML representation of the Wikipedia data to push the nested data
> > requirements.
> >
> > Jacques
> >
> > On Sat, Jan 19, 2013 at 10:51 AM, Jacques Nadeau <
> jacques.drill@gmail.com>wrote:
> >
> >>
> >>> * I drew a UML diagram. I saw that there is some glifffy support in
> >>> confluenc,e but the free account is pretty much useless. I used omni
> >>> graffle to draw the diagram, but this is payware on the mac - is there
> some
> >>> usable freeware alternative? Don't mention tigris :-)
> >>
> >> I don't have any suggestions on this.
> >>> * I have some ideas on the queries, but I am not sure how I should
> >>> specify them? Should I use pseudo SQL? Prose? I saw the syntax
> document on
> >>> the server, it it mature enough, that I attempt to use its syntax? Is
> there
> >>> a BNF or better ANTLR grammar I can use to check my syntax? Should I
> draw
> >>> one up while I am at it?
> >>
> >> I suggest you target SQL2003 (including subqueries).  We're looking at
> how
> >> to use Optiq's SQL parser for Drill.  Our goal is to stay as close as
> >> possible to that spec but add the following extensions:
> >> - Add flatten operator similar to BigQuery syntax.
> >> - Support use of selection and output identifiers using dotted/bracketed
> >> notation.  E.g. "select person.children[0].age as
> >> output.profile.firstChildAge"
> >> - Support new functions that can accept nested values including
> >> collections and maps.  For example "select
> ARRAY_LENGTH(person.children)".
> >>
> >> Once you have some sql examples, the next goal would be to manually
> >> translate those into Logical Plan syntax.  This syntax is still
> maturing so
> >> I'd take it to the SQL stage first.
> >>
> >>
> >>
> >>>
> >>>
> >>>
> >>> Stefan
> >>>
> >>>
> >>>
> >>> On 19.01.2013, at 02:05, Jacques Nadeau <ja...@gmail.com>
> wrote:
> >>>
> >>>> The wiki is up.  Michael and Stefan, it would be great if you started
> >>>> putting your use case thoughts there.
> >>>>
> >>>> Jacques
> >>>>
> >>>> On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> Ahh... yes.  That wiki.  I will ping infra again.
> >>>>>
> >>>>> (I was attaching your comment to the wikipedia use case and had
> >>> confused
> >>>>> myself)
> >>>>>
> >>>>> On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
> >>>>> michael.hausenblas@gmail.com> wrote:
> >>>>>
> >>>>>>
> >>>>>>> What do you need from me?
> >>>>>>
> >>>>>> Maybe I've overlooked something in which case I apologize - was
> >>> wondering
> >>>>>> if the public Wiki for Drill is available where Stefan, I and others
> >>> can
> >>>>>> write up the UC and queries.
> >>>>>>
> >>>>>> Cheers,
> >>>>>>               Michael
> >>>>>>
> >>>>>> --
> >>>>>> Michael Hausenblas
> >>>>>> Ireland, Europe
> >>>>>> http://mhausenblas.info/
> >>>>>>
> >>>>>> On 13 Jan 2013, at 14:20, Ted Dunning <te...@gmail.com>
> wrote:
> >>>>>>
> >>>>>>> What do you need from me?
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
> >>>>>>> michael.hausenblas@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> as soon as we hear back from Ted re the Wiki we work there.
> >>
>

Re: Introduction

Posted by "Siprell, Stefan" <st...@exxeta.de>.

That is what i am doing as the SQL dumps were too large. I was going to Map the XML to tables and columns to generate the SQL. 

Stefan 

Von meinem iPad gesendet

Am 19.01.2013 um 23:21 schrieb "Jacques Nadeau" <ja...@gmail.com>:

> Stefan, one other thought.  It might also be interesting to explore working
> with the XML representation of the Wikipedia data to push the nested data
> requirements.
> 
> Jacques
> 
> On Sat, Jan 19, 2013 at 10:51 AM, Jacques Nadeau <ja...@gmail.com>wrote:
> 
>> 
>>> * I drew a UML diagram. I saw that there is some glifffy support in
>>> confluenc,e but the free account is pretty much useless. I used omni
>>> graffle to draw the diagram, but this is payware on the mac - is there some
>>> usable freeware alternative? Don't mention tigris :-)
>> 
>> I don't have any suggestions on this.
>>> * I have some ideas on the queries, but I am not sure how I should
>>> specify them? Should I use pseudo SQL? Prose? I saw the syntax document on
>>> the server, it it mature enough, that I attempt to use its syntax? Is there
>>> a BNF or better ANTLR grammar I can use to check my syntax? Should I draw
>>> one up while I am at it?
>> 
>> I suggest you target SQL2003 (including subqueries).  We're looking at how
>> to use Optiq's SQL parser for Drill.  Our goal is to stay as close as
>> possible to that spec but add the following extensions:
>> - Add flatten operator similar to BigQuery syntax.
>> - Support use of selection and output identifiers using dotted/bracketed
>> notation.  E.g. "select person.children[0].age as
>> output.profile.firstChildAge"
>> - Support new functions that can accept nested values including
>> collections and maps.  For example "select ARRAY_LENGTH(person.children)".
>> 
>> Once you have some sql examples, the next goal would be to manually
>> translate those into Logical Plan syntax.  This syntax is still maturing so
>> I'd take it to the SQL stage first.
>> 
>> 
>> 
>>> 
>>> 
>>> 
>>> Stefan
>>> 
>>> 
>>> 
>>> On 19.01.2013, at 02:05, Jacques Nadeau <ja...@gmail.com> wrote:
>>> 
>>>> The wiki is up.  Michael and Stefan, it would be great if you started
>>>> putting your use case thoughts there.
>>>> 
>>>> Jacques
>>>> 
>>>> On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>> 
>>>>> Ahh... yes.  That wiki.  I will ping infra again.
>>>>> 
>>>>> (I was attaching your comment to the wikipedia use case and had
>>> confused
>>>>> myself)
>>>>> 
>>>>> On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
>>>>> michael.hausenblas@gmail.com> wrote:
>>>>> 
>>>>>> 
>>>>>>> What do you need from me?
>>>>>> 
>>>>>> Maybe I've overlooked something in which case I apologize - was
>>> wondering
>>>>>> if the public Wiki for Drill is available where Stefan, I and others
>>> can
>>>>>> write up the UC and queries.
>>>>>> 
>>>>>> Cheers,
>>>>>>               Michael
>>>>>> 
>>>>>> --
>>>>>> Michael Hausenblas
>>>>>> Ireland, Europe
>>>>>> http://mhausenblas.info/
>>>>>> 
>>>>>> On 13 Jan 2013, at 14:20, Ted Dunning <te...@gmail.com> wrote:
>>>>>> 
>>>>>>> What do you need from me?
>>>>>>> 
>>>>>>> 
>>>>>>> On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
>>>>>>> michael.hausenblas@gmail.com> wrote:
>>>>>>> 
>>>>>>>> as soon as we hear back from Ted re the Wiki we work there.
>>

Re: Introduction

Posted by Jacques Nadeau <ja...@gmail.com>.

Stefan, one other thought.  It might also be interesting to explore working
with the XML representation of the Wikipedia data to push the nested data
requirements.

Jacques

On Sat, Jan 19, 2013 at 10:51 AM, Jacques Nadeau <ja...@gmail.com>wrote:

>
>> * I drew a UML diagram. I saw that there is some glifffy support in
>> confluenc,e but the free account is pretty much useless. I used omni
>> graffle to draw the diagram, but this is payware on the mac - is there some
>> usable freeware alternative? Don't mention tigris :-)
>>
>
> I don't have any suggestions on this.
>
>
>> * I have some ideas on the queries, but I am not sure how I should
>> specify them? Should I use pseudo SQL? Prose? I saw the syntax document on
>> the server, it it mature enough, that I attempt to use its syntax? Is there
>> a BNF or better ANTLR grammar I can use to check my syntax? Should I draw
>> one up while I am at it?
>>
>
> I suggest you target SQL2003 (including subqueries).  We're looking at how
> to use Optiq's SQL parser for Drill.  Our goal is to stay as close as
> possible to that spec but add the following extensions:
> - Add flatten operator similar to BigQuery syntax.
> - Support use of selection and output identifiers using dotted/bracketed
> notation.  E.g. "select person.children[0].age as
> output.profile.firstChildAge"
> - Support new functions that can accept nested values including
> collections and maps.  For example "select ARRAY_LENGTH(person.children)".
>
> Once you have some sql examples, the next goal would be to manually
> translate those into Logical Plan syntax.  This syntax is still maturing so
> I'd take it to the SQL stage first.
>
>
>
>>
>>
>>
>> Stefan
>>
>>
>>
>> On 19.01.2013, at 02:05, Jacques Nadeau <ja...@gmail.com> wrote:
>>
>> > The wiki is up.  Michael and Stefan, it would be great if you started
>> > putting your use case thoughts there.
>> >
>> > Jacques
>> >
>> > On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> >
>> >> Ahh... yes.  That wiki.  I will ping infra again.
>> >>
>> >> (I was attaching your comment to the wikipedia use case and had
>> confused
>> >> myself)
>> >>
>> >> On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
>> >> michael.hausenblas@gmail.com> wrote:
>> >>
>> >>>
>> >>>> What do you need from me?
>> >>>
>> >>> Maybe I've overlooked something in which case I apologize - was
>> wondering
>> >>> if the public Wiki for Drill is available where Stefan, I and others
>> can
>> >>> write up the UC and queries.
>> >>>
>> >>> Cheers,
>> >>>                Michael
>> >>>
>> >>> --
>> >>> Michael Hausenblas
>> >>> Ireland, Europe
>> >>> http://mhausenblas.info/
>> >>>
>> >>> On 13 Jan 2013, at 14:20, Ted Dunning <te...@gmail.com> wrote:
>> >>>
>> >>>> What do you need from me?
>> >>>>
>> >>>>
>> >>>> On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
>> >>>> michael.hausenblas@gmail.com> wrote:
>> >>>>
>> >>>>> as soon as we hear back from Ted re the Wiki we work there.
>> >>>
>> >>>
>> >>
>>
>>
>

Re: Introduction

Posted by Jacques Nadeau <ja...@gmail.com>.

>
>
> * I drew a UML diagram. I saw that there is some glifffy support in
> confluenc,e but the free account is pretty much useless. I used omni
> graffle to draw the diagram, but this is payware on the mac - is there some
> usable freeware alternative? Don't mention tigris :-)
>

I don't have any suggestions on this.


> * I have some ideas on the queries, but I am not sure how I should specify
> them? Should I use pseudo SQL? Prose? I saw the syntax document on the
> server, it it mature enough, that I attempt to use its syntax? Is there a
> BNF or better ANTLR grammar I can use to check my syntax? Should I draw one
> up while I am at it?
>

I suggest you target SQL2003 (including subqueries).  We're looking at how
to use Optiq's SQL parser for Drill.  Our goal is to stay as close as
possible to that spec but add the following extensions:
- Add flatten operator similar to BigQuery syntax.
- Support use of selection and output identifiers using dotted/bracketed
notation.  E.g. "select person.children[0].age as
output.profile.firstChildAge"
- Support new functions that can accept nested values including collections
and maps.  For example "select ARRAY_LENGTH(person.children)".

Once you have some sql examples, the next goal would be to manually
translate those into Logical Plan syntax.  This syntax is still maturing so
I'd take it to the SQL stage first.



>
>
>
> Stefan
>
>
>
> On 19.01.2013, at 02:05, Jacques Nadeau <ja...@gmail.com> wrote:
>
> > The wiki is up.  Michael and Stefan, it would be great if you started
> > putting your use case thoughts there.
> >
> > Jacques
> >
> > On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> Ahh... yes.  That wiki.  I will ping infra again.
> >>
> >> (I was attaching your comment to the wikipedia use case and had confused
> >> myself)
> >>
> >> On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
> >> michael.hausenblas@gmail.com> wrote:
> >>
> >>>
> >>>> What do you need from me?
> >>>
> >>> Maybe I've overlooked something in which case I apologize - was
> wondering
> >>> if the public Wiki for Drill is available where Stefan, I and others
> can
> >>> write up the UC and queries.
> >>>
> >>> Cheers,
> >>>                Michael
> >>>
> >>> --
> >>> Michael Hausenblas
> >>> Ireland, Europe
> >>> http://mhausenblas.info/
> >>>
> >>> On 13 Jan 2013, at 14:20, Ted Dunning <te...@gmail.com> wrote:
> >>>
> >>>> What do you need from me?
> >>>>
> >>>>
> >>>> On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
> >>>> michael.hausenblas@gmail.com> wrote:
> >>>>
> >>>>> as soon as we hear back from Ted re the Wiki we work there.
> >>>
> >>>
> >>
>
>

Re: Introduction

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Jan 19, 2013 at 1:30 PM, Siprell, Stefan
<st...@exxeta.de>wrote:

> ...
> * I drew a UML diagram. I saw that there is some glifffy support in
> confluenc,e but the free account is pretty much useless. I used omni
> graffle to draw the diagram, but this is payware on the mac - is there some
> usable freeware alternative? Don't mention tigris :-)
>

Let's start with omni graffle since it is actually quite commonly used (at
least in our community).  If somebody else has a free tool that is up to
the task, we can switch.

I know that IntelliJ does a decent job extracting UML diagrams from the
code.

> * I have some ideas on the queries, but I am not sure how I should specify
> them? Should I use pseudo SQL? Prose? I saw the syntax document on the
> server, it it mature enough, that I attempt to use its syntax? Is there a
> BNF or better ANTLR grammar I can use to check my syntax? Should I draw one
> up while I am at it?
>

I agree with the others that approximate SQL is a good choice here.

For XML style nesting of element within element, just pretend you have
ordinary SQL but add some nesting syntax to the field references.

This trick will come with repeated fields.

The syntax document has to do for now with the syntax of the internal
logical plan language.  This is the machine generated entry point.  It
would be good to build these queries eventually for testing, but right now
any handy pseudo code is fine.

Re: Introduction

Posted by "Siprell, Stefan" <st...@exxeta.de>.

Hi,
started work on the first page. I needed to get my hands on some demo data and wrap my head around things. I did not yet start the use cases, but prepared the ground work. I do have some questions:

* I drew a UML diagram. I saw that there is some glifffy support in confluenc,e but the free account is pretty much useless. I used omni graffle to draw the diagram, but this is payware on the mac - is there some usable freeware alternative? Don't mention tigris :-)
* I have some ideas on the queries, but I am not sure how I should specify them? Should I use pseudo SQL? Prose? I saw the syntax document on the server, it it mature enough, that I attempt to use its syntax? Is there a BNF or better ANTLR grammar I can use to check my syntax? Should I draw one up while I am at it?

Stefan

On 19.01.2013, at 02:05, Jacques Nadeau <ja...@gmail.com> wrote:

> The wiki is up.  Michael and Stefan, it would be great if you started
> putting your use case thoughts there.
> 
> Jacques
> 
> On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <te...@gmail.com> wrote:
> 
>> Ahh... yes.  That wiki.  I will ping infra again.
>> 
>> (I was attaching your comment to the wikipedia use case and had confused
>> myself)
>> 
>> On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
>> michael.hausenblas@gmail.com> wrote:
>> 
>>> 
>>>> What do you need from me?
>>> 
>>> Maybe I've overlooked something in which case I apologize - was wondering
>>> if the public Wiki for Drill is available where Stefan, I and others can
>>> write up the UC and queries.
>>> 
>>> Cheers,
>>>                Michael
>>> 
>>> --
>>> Michael Hausenblas
>>> Ireland, Europe
>>> http://mhausenblas.info/
>>> 
>>> On 13 Jan 2013, at 14:20, Ted Dunning <te...@gmail.com> wrote:
>>> 
>>>> What do you need from me?
>>>> 
>>>> 
>>>> On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
>>>> michael.hausenblas@gmail.com> wrote:
>>>> 
>>>>> as soon as we hear back from Ted re the Wiki we work there.
>>> 
>>> 
>>

Re: Introduction

Posted by Jacques Nadeau <ja...@gmail.com>.

The wiki is up.  Michael and Stefan, it would be great if you started
putting your use case thoughts there.

Jacques

On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <te...@gmail.com> wrote:

> Ahh... yes.  That wiki.  I will ping infra again.
>
> (I was attaching your comment to the wikipedia use case and had confused
> myself)
>
> On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
> michael.hausenblas@gmail.com> wrote:
>
> >
> > > What do you need from me?
> >
> > Maybe I've overlooked something in which case I apologize - was wondering
> > if the public Wiki for Drill is available where Stefan, I and others can
> > write up the UC and queries.
> >
> > Cheers,
> >                 Michael
> >
> > --
> > Michael Hausenblas
> > Ireland, Europe
> > http://mhausenblas.info/
> >
> > On 13 Jan 2013, at 14:20, Ted Dunning <te...@gmail.com> wrote:
> >
> > > What do you need from me?
> > >
> > >
> > > On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
> > > michael.hausenblas@gmail.com> wrote:
> > >
> > >> as soon as we hear back from Ted re the Wiki we work there.
> >
> >
>

Re: Introduction

Posted by Ted Dunning <te...@gmail.com>.

Ahh... yes.  That wiki.  I will ping infra again.

(I was attaching your comment to the wikipedia use case and had confused
myself)

On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
michael.hausenblas@gmail.com> wrote:

>
> > What do you need from me?
>
> Maybe I've overlooked something in which case I apologize - was wondering
> if the public Wiki for Drill is available where Stefan, I and others can
> write up the UC and queries.
>
> Cheers,
>                 Michael
>
> --
> Michael Hausenblas
> Ireland, Europe
> http://mhausenblas.info/
>
> On 13 Jan 2013, at 14:20, Ted Dunning <te...@gmail.com> wrote:
>
> > What do you need from me?
> >
> >
> > On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
> > michael.hausenblas@gmail.com> wrote:
> >
> >> as soon as we hear back from Ted re the Wiki we work there.
>
>

Re: Introduction

Posted by Michael Hausenblas <mi...@gmail.com>.

> What do you need from me?

Maybe I've overlooked something in which case I apologize - was wondering if the public Wiki for Drill is available where Stefan, I and others can write up the UC and queries.

Cheers,
		Michael

--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/

On 13 Jan 2013, at 14:20, Ted Dunning <te...@gmail.com> wrote:

> What do you need from me?
> 
> 
> On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
> michael.hausenblas@gmail.com> wrote:
> 
>> as soon as we hear back from Ted re the Wiki we work there.

Re: Introduction

Posted by Ted Dunning <te...@gmail.com>.

What do you need from me?


On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
michael.hausenblas@gmail.com> wrote:

> as soon as we hear back from Ted re the Wiki we work there.

Re: Introduction

Posted by Michael Hausenblas <mi...@gmail.com>.

Stefan,

> glad that I can help. May I suggest that I continue in the creation of use cases and the respective types of query profiles:
> * Wikipedia Edit History: After an initial glance the history is made up of 40 or so tables. I would design some user stories using join like queries across multiple tables - or however they are called in Drill.
> * I did not have an opportunity to check the Enron Stuff, but here I would design user stories as if building an email client, this would lead to heavy usage of a full text searching.
> 
> There are some additional data-sets I would like to suggest: http://aws.amazon.com/datasets
> 
> * Freebase.com: Simulate a visualization to jump from topic to topic as usert stories. This would lead to queries on a random and very small rowset.
> * Wikipedia Page Traffic Statistics: Simulate a log analysis. Heavy aggregation and date function on a large number of rows.
> * Global Weather Measurements: Design user stories based on geographic and chronoligic aggregation of climate data to visualize trends.

That sounds great! I reckon, as soon as we hear back from Ted re the Wiki we work there. For the time being, let's continue the discussion here.

Cheers,
		Michael

--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/

On 11 Jan 2013, at 00:18, "Siprell, Stefan" <st...@exxeta.de> wrote:

> Hi,
> glad that I can help. May I suggest that I continue in the creation of use cases and the respective types of query profiles:
> * Wikipedia Edit History: After an initial glance the history is made up of 40 or so tables. I would design some user stories using join like queries across multiple tables - or however they are called in Drill.
> * I did not have an opportunity to check the Enron Stuff, but here I would design user stories as if building an email client, this would lead to heavy usage of a full text searching.
> 
> There are some additional data-sets I would like to suggest: http://aws.amazon.com/datasets
> 
> * Freebase.com: Simulate a visualization to jump from topic to topic as usert stories. This would lead to queries on a random and very small rowset.
> * Wikipedia Page Traffic Statistics: Simulate a log analysis. Heavy aggregation and date function on a large number of rows.
> * Global Weather Measurements: Design user stories based on geographic and chronoligic aggregation of climate data to visualize trends.
> 
> 
> Regards
> Stefan
> 
> ________________________________________
> Von: Michael Hausenblas [michael.hausenblas@gmail.com]
> Gesendet: Donnerstag, 10. Januar 2013 19:54
> An: drill-dev@incubator.apache.org
> Betreff: Re: Introduction
> 
>> Michael Hausenblas is beginning to collect data sets and query examples for
>> different plausible use cases ranging from small to large.  He should show
>> up on the mailing list shortly and you could coordinate with him.
> 
> 
> Welcome, Stefan - great to have you on board!
> 
> So the idea would be to compile a list of datasets along with typical (interesting) queries formulated in natural language. One thing we need to get this off the ground is the Wiki but I gather Ted is on that ..
> 
> Datasets that might be of interest include, but are not restricted to:
> 
> * Wikipedia edit history from [1]
> * Census data (US, Eurostat, etc.)
> * AOL search logs
> * Enron emails [2]
> 
> Feel free to come up with additional ones as well.
> 
> I suppose we can continue the discussion (who looks into what) here on the list and once the Wiki is available we can co-ordinate also via it.
> 
> Cheers,
>                Michael
> 
> [1] http://en.wikipedia.org/wiki/Wikipedia:Database_download
> [2] http://www.cs.cmu.edu/~enron/
> 
> --
> Michael Hausenblas
> Ireland, Europe
> http://mhausenblas.info/
> 
> On 10 Jan 2013, at 10:19, Ted Dunning <te...@gmail.com> wrote:
> 
>> Stefan,
>> 
>> One of the key things to do right now is to work on use cases.
>> 
>> Michael Hausenblas is beginning to collect data sets and query examples for
>> different plausible use cases ranging from small to large.  He should show
>> up on the mailing list shortly and you could coordinate with him.
>> 
>> On Thu, Jan 10, 2013 at 5:45 AM, Siprell, Stefan
>> <st...@exxeta.de>wrote:
>> 
>>> Hi all,
>>> I am working for a IT consulting agency in Germany. One of the goals of
>>> our team for 2013 is active (as in giving) participation in the open source
>>> community and offering our customers cutting-edge analytical tools for
>>> large to huge data bases. You guys hit the spot!
>>> 
>>> I would like to start offering my personal help (volunteer work for now,
>>> later I could pitch in a day or two per week perhaps) in any role which
>>> would help. I am a somewhat strong enterprise java developer, can deal
>>> sufficiently well with HTML5 frontends, know most things about build
>>> environments and testing and should be able to do some design or
>>> documentation.
>>> 
>>> Is there anything I can do?
>>> 
>>> Stefan

AW: Introduction

Posted by "Siprell, Stefan" <st...@exxeta.de>.

Hi,
glad that I can help. May I suggest that I continue in the creation of use cases and the respective types of query profiles:
* Wikipedia Edit History: After an initial glance the history is made up of 40 or so tables. I would design some user stories using join like queries across multiple tables - or however they are called in Drill.
* I did not have an opportunity to check the Enron Stuff, but here I would design user stories as if building an email client, this would lead to heavy usage of a full text searching.

There are some additional data-sets I would like to suggest: http://aws.amazon.com/datasets

* Freebase.com: Simulate a visualization to jump from topic to topic as usert stories. This would lead to queries on a random and very small rowset.
* Wikipedia Page Traffic Statistics: Simulate a log analysis. Heavy aggregation and date function on a large number of rows.
* Global Weather Measurements: Design user stories based on geographic and chronoligic aggregation of climate data to visualize trends.

Regards
Stefan

________________________________________
Von: Michael Hausenblas [michael.hausenblas@gmail.com]
Gesendet: Donnerstag, 10. Januar 2013 19:54
An: drill-dev@incubator.apache.org
Betreff: Re: Introduction

> Michael Hausenblas is beginning to collect data sets and query examples for
> different plausible use cases ranging from small to large.  He should show
> up on the mailing list shortly and you could coordinate with him.

Welcome, Stefan - great to have you on board!

So the idea would be to compile a list of datasets along with typical (interesting) queries formulated in natural language. One thing we need to get this off the ground is the Wiki but I gather Ted is on that ..

Datasets that might be of interest include, but are not restricted to:

 * Wikipedia edit history from [1]
 * Census data (US, Eurostat, etc.)
 * AOL search logs
 * Enron emails [2]

Feel free to come up with additional ones as well.

I suppose we can continue the discussion (who looks into what) here on the list and once the Wiki is available we can co-ordinate also via it.

Cheers,
                Michael

[1] http://en.wikipedia.org/wiki/Wikipedia:Database_download
[2] http://www.cs.cmu.edu/~enron/

--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/

On 10 Jan 2013, at 10:19, Ted Dunning <te...@gmail.com> wrote:

> Stefan,
>
> One of the key things to do right now is to work on use cases.
>
> Michael Hausenblas is beginning to collect data sets and query examples for
> different plausible use cases ranging from small to large.  He should show
> up on the mailing list shortly and you could coordinate with him.
>
> On Thu, Jan 10, 2013 at 5:45 AM, Siprell, Stefan
> <st...@exxeta.de>wrote:
>
>> Hi all,
>> I am working for a IT consulting agency in Germany. One of the goals of
>> our team for 2013 is active (as in giving) participation in the open source
>> community and offering our customers cutting-edge analytical tools for
>> large to huge data bases. You guys hit the spot!
>>
>> I would like to start offering my personal help (volunteer work for now,
>> later I could pitch in a day or two per week perhaps) in any role which
>> would help. I am a somewhat strong enterprise java developer, can deal
>> sufficiently well with HTML5 frontends, know most things about build
>> environments and testing and should be able to do some design or
>> documentation.
>>
>> Is there anything I can do?
>>
>> Stefan
>>

Re: Introduction

Posted by Michael Hausenblas <mi...@gmail.com>.


> Michael Hausenblas is beginning to collect data sets and query examples for
> different plausible use cases ranging from small to large.  He should show
> up on the mailing list shortly and you could coordinate with him.


Welcome, Stefan - great to have you on board!

So the idea would be to compile a list of datasets along with typical (interesting) queries formulated in natural language. One thing we need to get this off the ground is the Wiki but I gather Ted is on that ..

Datasets that might be of interest include, but are not restricted to:

 * Wikipedia edit history from [1]
 * Census data (US, Eurostat, etc.) 
 * AOL search logs 
 * Enron emails [2]

Feel free to come up with additional ones as well.

I suppose we can continue the discussion (who looks into what) here on the list and once the Wiki is available we can co-ordinate also via it.

Cheers,
		Michael

[1] http://en.wikipedia.org/wiki/Wikipedia:Database_download
[2] http://www.cs.cmu.edu/~enron/

--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/

On 10 Jan 2013, at 10:19, Ted Dunning <te...@gmail.com> wrote:

> Stefan,
> 
> One of the key things to do right now is to work on use cases.
> 
> Michael Hausenblas is beginning to collect data sets and query examples for
> different plausible use cases ranging from small to large.  He should show
> up on the mailing list shortly and you could coordinate with him.
> 
> On Thu, Jan 10, 2013 at 5:45 AM, Siprell, Stefan
> <st...@exxeta.de>wrote:
> 
>> Hi all,
>> I am working for a IT consulting agency in Germany. One of the goals of
>> our team for 2013 is active (as in giving) participation in the open source
>> community and offering our customers cutting-edge analytical tools for
>> large to huge data bases. You guys hit the spot!
>> 
>> I would like to start offering my personal help (volunteer work for now,
>> later I could pitch in a day or two per week perhaps) in any role which
>> would help. I am a somewhat strong enterprise java developer, can deal
>> sufficiently well with HTML5 frontends, know most things about build
>> environments and testing and should be able to do some design or
>> documentation.
>> 
>> Is there anything I can do?
>> 
>> Stefan
>>

Re: Introduction

Posted by Ted Dunning <te...@gmail.com>.

Can't wait to see what you get.

On Mon, Jan 14, 2013 at 11:24 AM, Jason <ja...@apache.org> wrote:

> +100 I agree and i'm looking into it :-)
>
>
> On Fri, Jan 11, 2013 at 2:29 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Jason,
> >
> > One of the most interesting possibilities with Scala relative to Drill
> > would be to have a Scala DSL that produces queries in the logical plan
> > syntax.  This would allow what is essentially a native embedding of SQL
> > capabilities in Scala.
> >
> > On Fri, Jan 11, 2013 at 9:58 AM, Jason <ja...@apache.org> wrote:
> >
> > > I like medals :-)
> > >
> > > I also want to see this project flourish.
> > >
> > > Since we're on the topic of 'stuff', I'm trying to keep quiet (for now)
> > but
> > > I've already been planning a lot of Scala work with Drill. Meaning,
> > either
> > > a direct port of certain pieces (or all and using Scala to it's
> fullest)
> > or
> > > a lang binding (or a combination of). I'm interested in seeing If Scala
> > may
> > > allow playing around with drill's architecture / concepts on an even
> > higher
> > > order.
> > >
> > > - J
> > >
> > >
> > > On Thu, Jan 10, 2013 at 2:07 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > Jason,
> > > >
> > > > That would be very helpful.  I am scheduled back to back for the rest
> > of
> > > > today and then on the road.
> > > >
> > > > You would be awarded a virtual hero medal.
> > > >
> > > > On Thu, Jan 10, 2013 at 10:36 AM, Jason <ja...@apache.org> wrote:
> > > >
> > > > > I can try to get a draft report put together (for 2012 Q4?) -J
> > > > >
> > > > >
> > > > > On Thu, Jan 10, 2013 at 1:19 PM, Ted Dunning <
> ted.dunning@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Stefan,
> > > > > >
> > > > > > One of the key things to do right now is to work on use cases.
> > > > > >
> > > > > > Michael Hausenblas is beginning to collect data sets and query
> > > examples
> > > > > for
> > > > > > different plausible use cases ranging from small to large.  He
> > should
> > > > > show
> > > > > > up on the mailing list shortly and you could coordinate with him.
> > > > > >
> > > > > > On Thu, Jan 10, 2013 at 5:45 AM, Siprell, Stefan
> > > > > > <st...@exxeta.de>wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > > I am working for a IT consulting agency in Germany. One of the
> > > goals
> > > > of
> > > > > > > our team for 2013 is active (as in giving) participation in the
> > > open
> > > > > > source
> > > > > > > community and offering our customers cutting-edge analytical
> > tools
> > > > for
> > > > > > > large to huge data bases. You guys hit the spot!
> > > > > > >
> > > > > > > I would like to start offering my personal help (volunteer work
> > for
> > > > > now,
> > > > > > > later I could pitch in a day or two per week perhaps) in any
> role
> > > > which
> > > > > > > would help. I am a somewhat strong enterprise java developer,
> can
> > > > deal
> > > > > > > sufficiently well with HTML5 frontends, know most things about
> > > build
> > > > > > > environments and testing and should be able to do some design
> or
> > > > > > > documentation.
> > > > > > >
> > > > > > > Is there anything I can do?
> > > > > > >
> > > > > > > Stefan
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Introduction

Posted by Jason <ja...@apache.org>.

+100 I agree and i'm looking into it :-)


On Fri, Jan 11, 2013 at 2:29 PM, Ted Dunning <te...@gmail.com> wrote:

> Jason,
>
> One of the most interesting possibilities with Scala relative to Drill
> would be to have a Scala DSL that produces queries in the logical plan
> syntax.  This would allow what is essentially a native embedding of SQL
> capabilities in Scala.
>
> On Fri, Jan 11, 2013 at 9:58 AM, Jason <ja...@apache.org> wrote:
>
> > I like medals :-)
> >
> > I also want to see this project flourish.
> >
> > Since we're on the topic of 'stuff', I'm trying to keep quiet (for now)
> but
> > I've already been planning a lot of Scala work with Drill. Meaning,
> either
> > a direct port of certain pieces (or all and using Scala to it's fullest)
> or
> > a lang binding (or a combination of). I'm interested in seeing If Scala
> may
> > allow playing around with drill's architecture / concepts on an even
> higher
> > order.
> >
> > - J
> >
> >
> > On Thu, Jan 10, 2013 at 2:07 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Jason,
> > >
> > > That would be very helpful.  I am scheduled back to back for the rest
> of
> > > today and then on the road.
> > >
> > > You would be awarded a virtual hero medal.
> > >
> > > On Thu, Jan 10, 2013 at 10:36 AM, Jason <ja...@apache.org> wrote:
> > >
> > > > I can try to get a draft report put together (for 2012 Q4?) -J
> > > >
> > > >
> > > > On Thu, Jan 10, 2013 at 1:19 PM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > Stefan,
> > > > >
> > > > > One of the key things to do right now is to work on use cases.
> > > > >
> > > > > Michael Hausenblas is beginning to collect data sets and query
> > examples
> > > > for
> > > > > different plausible use cases ranging from small to large.  He
> should
> > > > show
> > > > > up on the mailing list shortly and you could coordinate with him.
> > > > >
> > > > > On Thu, Jan 10, 2013 at 5:45 AM, Siprell, Stefan
> > > > > <st...@exxeta.de>wrote:
> > > > >
> > > > > > Hi all,
> > > > > > I am working for a IT consulting agency in Germany. One of the
> > goals
> > > of
> > > > > > our team for 2013 is active (as in giving) participation in the
> > open
> > > > > source
> > > > > > community and offering our customers cutting-edge analytical
> tools
> > > for
> > > > > > large to huge data bases. You guys hit the spot!
> > > > > >
> > > > > > I would like to start offering my personal help (volunteer work
> for
> > > > now,
> > > > > > later I could pitch in a day or two per week perhaps) in any role
> > > which
> > > > > > would help. I am a somewhat strong enterprise java developer, can
> > > deal
> > > > > > sufficiently well with HTML5 frontends, know most things about
> > build
> > > > > > environments and testing and should be able to do some design or
> > > > > > documentation.
> > > > > >
> > > > > > Is there anything I can do?
> > > > > >
> > > > > > Stefan
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Introduction

Posted by Ted Dunning <te...@gmail.com>.

Jason,

One of the most interesting possibilities with Scala relative to Drill
would be to have a Scala DSL that produces queries in the logical plan
syntax.  This would allow what is essentially a native embedding of SQL
capabilities in Scala.

On Fri, Jan 11, 2013 at 9:58 AM, Jason <ja...@apache.org> wrote:

> I like medals :-)
>
> I also want to see this project flourish.
>
> Since we're on the topic of 'stuff', I'm trying to keep quiet (for now) but
> I've already been planning a lot of Scala work with Drill. Meaning, either
> a direct port of certain pieces (or all and using Scala to it's fullest) or
> a lang binding (or a combination of). I'm interested in seeing If Scala may
> allow playing around with drill's architecture / concepts on an even higher
> order.
>
> - J
>
>
> On Thu, Jan 10, 2013 at 2:07 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Jason,
> >
> > That would be very helpful.  I am scheduled back to back for the rest of
> > today and then on the road.
> >
> > You would be awarded a virtual hero medal.
> >
> > On Thu, Jan 10, 2013 at 10:36 AM, Jason <ja...@apache.org> wrote:
> >
> > > I can try to get a draft report put together (for 2012 Q4?) -J
> > >
> > >
> > > On Thu, Jan 10, 2013 at 1:19 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > Stefan,
> > > >
> > > > One of the key things to do right now is to work on use cases.
> > > >
> > > > Michael Hausenblas is beginning to collect data sets and query
> examples
> > > for
> > > > different plausible use cases ranging from small to large.  He should
> > > show
> > > > up on the mailing list shortly and you could coordinate with him.
> > > >
> > > > On Thu, Jan 10, 2013 at 5:45 AM, Siprell, Stefan
> > > > <st...@exxeta.de>wrote:
> > > >
> > > > > Hi all,
> > > > > I am working for a IT consulting agency in Germany. One of the
> goals
> > of
> > > > > our team for 2013 is active (as in giving) participation in the
> open
> > > > source
> > > > > community and offering our customers cutting-edge analytical tools
> > for
> > > > > large to huge data bases. You guys hit the spot!
> > > > >
> > > > > I would like to start offering my personal help (volunteer work for
> > > now,
> > > > > later I could pitch in a day or two per week perhaps) in any role
> > which
> > > > > would help. I am a somewhat strong enterprise java developer, can
> > deal
> > > > > sufficiently well with HTML5 frontends, know most things about
> build
> > > > > environments and testing and should be able to do some design or
> > > > > documentation.
> > > > >
> > > > > Is there anything I can do?
> > > > >
> > > > > Stefan
> > > > >
> > > >
> > >
> >
>

Re: Introduction

Posted by Jason <ja...@apache.org>.

I like medals :-)

I also want to see this project flourish.

Since we're on the topic of 'stuff', I'm trying to keep quiet (for now) but
I've already been planning a lot of Scala work with Drill. Meaning, either
a direct port of certain pieces (or all and using Scala to it's fullest) or
a lang binding (or a combination of). I'm interested in seeing If Scala may
allow playing around with drill's architecture / concepts on an even higher
order.

- J


On Thu, Jan 10, 2013 at 2:07 PM, Ted Dunning <te...@gmail.com> wrote:

> Jason,
>
> That would be very helpful.  I am scheduled back to back for the rest of
> today and then on the road.
>
> You would be awarded a virtual hero medal.
>
> On Thu, Jan 10, 2013 at 10:36 AM, Jason <ja...@apache.org> wrote:
>
> > I can try to get a draft report put together (for 2012 Q4?) -J
> >
> >
> > On Thu, Jan 10, 2013 at 1:19 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Stefan,
> > >
> > > One of the key things to do right now is to work on use cases.
> > >
> > > Michael Hausenblas is beginning to collect data sets and query examples
> > for
> > > different plausible use cases ranging from small to large.  He should
> > show
> > > up on the mailing list shortly and you could coordinate with him.
> > >
> > > On Thu, Jan 10, 2013 at 5:45 AM, Siprell, Stefan
> > > <st...@exxeta.de>wrote:
> > >
> > > > Hi all,
> > > > I am working for a IT consulting agency in Germany. One of the goals
> of
> > > > our team for 2013 is active (as in giving) participation in the open
> > > source
> > > > community and offering our customers cutting-edge analytical tools
> for
> > > > large to huge data bases. You guys hit the spot!
> > > >
> > > > I would like to start offering my personal help (volunteer work for
> > now,
> > > > later I could pitch in a day or two per week perhaps) in any role
> which
> > > > would help. I am a somewhat strong enterprise java developer, can
> deal
> > > > sufficiently well with HTML5 frontends, know most things about build
> > > > environments and testing and should be able to do some design or
> > > > documentation.
> > > >
> > > > Is there anything I can do?
> > > >
> > > > Stefan
> > > >
> > >
> >
>

Re: Introduction

Posted by Ted Dunning <te...@gmail.com>.

Jason,

That would be very helpful.  I am scheduled back to back for the rest of
today and then on the road.

You would be awarded a virtual hero medal.

On Thu, Jan 10, 2013 at 10:36 AM, Jason <ja...@apache.org> wrote:

> I can try to get a draft report put together (for 2012 Q4?) -J
>
>
> On Thu, Jan 10, 2013 at 1:19 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Stefan,
> >
> > One of the key things to do right now is to work on use cases.
> >
> > Michael Hausenblas is beginning to collect data sets and query examples
> for
> > different plausible use cases ranging from small to large.  He should
> show
> > up on the mailing list shortly and you could coordinate with him.
> >
> > On Thu, Jan 10, 2013 at 5:45 AM, Siprell, Stefan
> > <st...@exxeta.de>wrote:
> >
> > > Hi all,
> > > I am working for a IT consulting agency in Germany. One of the goals of
> > > our team for 2013 is active (as in giving) participation in the open
> > source
> > > community and offering our customers cutting-edge analytical tools for
> > > large to huge data bases. You guys hit the spot!
> > >
> > > I would like to start offering my personal help (volunteer work for
> now,
> > > later I could pitch in a day or two per week perhaps) in any role which
> > > would help. I am a somewhat strong enterprise java developer, can deal
> > > sufficiently well with HTML5 frontends, know most things about build
> > > environments and testing and should be able to do some design or
> > > documentation.
> > >
> > > Is there anything I can do?
> > >
> > > Stefan
> > >
> >
>

Re: Introduction

Posted by Jason <ja...@apache.org>.

I can try to get a draft report put together (for 2012 Q4?) -J


On Thu, Jan 10, 2013 at 1:19 PM, Ted Dunning <te...@gmail.com> wrote:

> Stefan,
>
> One of the key things to do right now is to work on use cases.
>
> Michael Hausenblas is beginning to collect data sets and query examples for
> different plausible use cases ranging from small to large.  He should show
> up on the mailing list shortly and you could coordinate with him.
>
> On Thu, Jan 10, 2013 at 5:45 AM, Siprell, Stefan
> <st...@exxeta.de>wrote:
>
> > Hi all,
> > I am working for a IT consulting agency in Germany. One of the goals of
> > our team for 2013 is active (as in giving) participation in the open
> source
> > community and offering our customers cutting-edge analytical tools for
> > large to huge data bases. You guys hit the spot!
> >
> > I would like to start offering my personal help (volunteer work for now,
> > later I could pitch in a day or two per week perhaps) in any role which
> > would help. I am a somewhat strong enterprise java developer, can deal
> > sufficiently well with HTML5 frontends, know most things about build
> > environments and testing and should be able to do some design or
> > documentation.
> >
> > Is there anything I can do?
> >
> > Stefan
> >
>

Re: Introduction

Posted by Ted Dunning <te...@gmail.com>.

Stefan,

One of the key things to do right now is to work on use cases.

Michael Hausenblas is beginning to collect data sets and query examples for
different plausible use cases ranging from small to large.  He should show
up on the mailing list shortly and you could coordinate with him.

On Thu, Jan 10, 2013 at 5:45 AM, Siprell, Stefan
<st...@exxeta.de>wrote:

> Hi all,
> I am working for a IT consulting agency in Germany. One of the goals of
> our team for 2013 is active (as in giving) participation in the open source
> community and offering our customers cutting-edge analytical tools for
> large to huge data bases. You guys hit the spot!
>
> I would like to start offering my personal help (volunteer work for now,
> later I could pitch in a day or two per week perhaps) in any role which
> would help. I am a somewhat strong enterprise java developer, can deal
> sufficiently well with HTML5 frontends, know most things about build
> environments and testing and should be able to do some design or
> documentation.
>
> Is there anything I can do?
>
> Stefan
>

Re: Introduction

Posted by Ellen Friedman <b....@gmail.com>.

Good luck and see you later...

Ellen

On Mon, Jan 28, 2013 at 11:15 PM, Siprell, Stefan
<st...@exxeta.de>wrote:

> Hi Ellen,
> I am bit in a bind at the moment due to to some personal things. I have to
> put my membership in a hibernate mode for the next couple of weeks, I will
> check in later.
>
> Stefan
>
> ________________________________________
> Von: Ellen Friedman [b.ellen.friedman@gmail.com]
> Gesendet: Sonntag, 27. Januar 2013 07:46
> An: drill-dev@incubator.apache.org
> Betreff: Re: Introduction
>
> Stefan,
>
> Welcome to Drill. I'm very interested in how the use cases will shape up,
> so I'll be watching the wiki.  Good luck!
>
> Ellen Friedman
>
> On Thu, Jan 10, 2013 at 5:45 AM, Siprell, Stefan
> <st...@exxeta.de>wrote:
>
> > Hi all,
> > I am working for a IT consulting agency in Germany. One of the goals of
> > our team for 2013 is active (as in giving) participation in the open
> source
> > community and offering our customers cutting-edge analytical tools for
> > large to huge data bases. You guys hit the spot!
> >
> > I would like to start offering my personal help (volunteer work for now,
> > later I could pitch in a day or two per week perhaps) in any role which
> > would help. I am a somewhat strong enterprise java developer, can deal
> > sufficiently well with HTML5 frontends, know most things about build
> > environments and testing and should be able to do some design or
> > documentation.
> >
> > Is there anything I can do?
> >
> > Stefan
> >
>

AW: Introduction

Posted by "Siprell, Stefan" <st...@exxeta.de>.

Hi Ellen,
I am bit in a bind at the moment due to to some personal things. I have to put my membership in a hibernate mode for the next couple of weeks, I will check in later.

Stefan

________________________________________
Von: Ellen Friedman [b.ellen.friedman@gmail.com]
Gesendet: Sonntag, 27. Januar 2013 07:46
An: drill-dev@incubator.apache.org
Betreff: Re: Introduction

Stefan,

Welcome to Drill. I'm very interested in how the use cases will shape up,
so I'll be watching the wiki.  Good luck!

Ellen Friedman

On Thu, Jan 10, 2013 at 5:45 AM, Siprell, Stefan
<st...@exxeta.de>wrote:

> Hi all,
> I am working for a IT consulting agency in Germany. One of the goals of
> our team for 2013 is active (as in giving) participation in the open source
> community and offering our customers cutting-edge analytical tools for
> large to huge data bases. You guys hit the spot!
>
> I would like to start offering my personal help (volunteer work for now,
> later I could pitch in a day or two per week perhaps) in any role which
> would help. I am a somewhat strong enterprise java developer, can deal
> sufficiently well with HTML5 frontends, know most things about build
> environments and testing and should be able to do some design or
> documentation.
>
> Is there anything I can do?
>
> Stefan
>

Re: Introduction

Posted by Ellen Friedman <b....@gmail.com>.

Stefan,

Welcome to Drill. I'm very interested in how the use cases will shape up,
so I'll be watching the wiki.  Good luck!

Ellen Friedman

On Thu, Jan 10, 2013 at 5:45 AM, Siprell, Stefan
<st...@exxeta.de>wrote:

> Hi all,
> I am working for a IT consulting agency in Germany. One of the goals of
> our team for 2013 is active (as in giving) participation in the open source
> community and offering our customers cutting-edge analytical tools for
> large to huge data bases. You guys hit the spot!
>
> I would like to start offering my personal help (volunteer work for now,
> later I could pitch in a day or two per week perhaps) in any role which
> would help. I am a somewhat strong enterprise java developer, can deal
> sufficiently well with HTML5 frontends, know most things about build
> environments and testing and should be able to do some design or
> documentation.
>
> Is there anything I can do?
>
> Stefan
>