You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Aman Sinha <am...@apache.org> on 2017/12/06 00:51:31 UTC
Re: [EXT] Re: Food for thought about intra-document operation

I have attached a work-in-progress design doc to https://issues.apache.
org/jira/browse/DRILL-5999.  Hope to finalize the design over the next week
or so.

-Aman

On Wed, Nov 29, 2017 at 9:36 AM, Aman Sinha <am...@apache.org> wrote:

> Damien,
> for the intra-document operations, it would be useful to add support for
> LATERAL joins (SQL standard), which in conjunction with UNNEST (or FLATTEN)
> should address the use case you have.  I have filed a JIRA for this:
> https://issues.apache.org/jira/browse/DRILL-5999.
>
> -Aman
>
> On Tue, Sep 26, 2017 at 12:04 AM, Damien Profeta <
> damien.profeta@amadeus.com> wrote:
>
>> Hello Aman,
>>
>> AsterixDb seems to follow the standard SQL with a few minor modifications
>> and add functions to ease aggregations (array_count, array_avg…)
>>
>> That would tend to confirm at least that the support of unnest is a good
>> idea to improve Drill.
>>
>> Best regards
>>
>> Damien
>>
>> **
>>
>> On 09/25/2017 07:53 PM, Aman Sinha wrote:
>>
>>> Damien,
>>> thanks for initiating the discussion..indeed this would be a very useful
>>> enhancement.  Currently, Drill provides repeated_contains()  for
>>> filtering
>>> and repeated_count() for count aggregates on arrays but not the general
>>> purpose intra-document operations that you need based on your example.
>>> I haven't gone through all the alternatives but in addition to what you
>>> have described,  you might also want to look at SQL++ (
>>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html) which has
>>> been
>>> adopted by AsterixDB and has syntax extensions to SQL for unstructured
>>> data.
>>>
>>> -Aman
>>>
>>> On Mon, Sep 25, 2017 at 6:10 AM, Damien Profeta <
>>> damien.profeta@amadeus.com>
>>> wrote:
>>>
>>> Hello,
>>>>
>>>> A few format handled by Drill enable to work with document, meaning
>>>> nested
>>>> and repeated structure instead of just tables. Json and Parquet are the
>>>> two
>>>> that come to my mind right now. Document modeling is a great way to
>>>> express
>>>> complex object and is used a lot in my company. Drill is able to handle
>>>> them but unfortunately, it cannot make much computation on it. By
>>>> computation I mean, filtering branches of the document, computing
>>>> statistics (avg, min, max) on part of the document … That would be very
>>>> useful as an analytic tools.
>>>>
>>>> _What can be done_
>>>>
>>>> The question then is how to express the computation we want to do on the
>>>> document. I have found multiple ways to handle that and I don't really
>>>> know
>>>> which one is the best hence the mail to expose what I have found to
>>>> initiate discussion, maybe.
>>>>
>>>> First, in we look back at the Dremel paper which is the base of the
>>>> parquet format and also one of the example for drill, dremel is adding
>>>> the
>>>> special keyword "WITHIN" to SQL to specify that the computation has to
>>>> be
>>>> done within a document. What is very powerful with this keyword is that
>>>> it
>>>> allows you to generate document and doesn't force you to flatten
>>>> everything. You can find exemple of it usage in the google successor of
>>>> Dremel: BigQuery and its documentation : https://cloud.google.com/bigqu
>>>> ery/docs/legacy-nested-repeated.
>>>>
>>>> But it seems that it was problematic for Google, because they now
>>>> propose
>>>> a SQL that seems to be compliant with SQL 2011 for Bigquery to handle
>>>> such
>>>> computation. I am not familiar with SQL 2011 but it is told in BigQuery
>>>> documentation to integrated the keywords for nested and repeated
>>>> structure.
>>>> You can have a view about how this is done in BigQuery here:
>>>> https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays .
>>>> Basically, what I have seen is that they leverage UNNEST and ARRAY
>>>> keyword
>>>> and then are able to use JOIN or CROSS JOIN to describe the aggregation.
>>>>
>>>> In Impala, they have added a way to add a subquery on a complex type in
>>>> such a way that the subquery only act intra-document. I have no idea if
>>>> this is standard SQL or not. In page https://www.cloudera.com/docum
>>>> entation/enterprise/5-5-x/topics/impala_complex_types.html#c
>>>> omplex_types
>>>> look at the phrase: “The subquery labelled SUBQ1 is correlated:” for
>>>> example.
>>>>
>>>> In Presto, you can apply lambda function to map/array to transform the
>>>> structure and apply filter on it. So you have filter, map_filter
>>>> function
>>>> to filter array and map respectively. (cf
>>>> https://prestodb.io/docs/curre
>>>> nt/functions/lambda.html#filter)
>>>>
>>>> _Example_
>>>>
>>>> If I want to make a short example, let’s say we have a flight with a
>>>> group
>>>> of passengers in it. A document would be :
>>>>
>>>> { “flightnb”:1234, “group”:[{“age”:30,”gender”:”M
>>>> ”},{“age”:15,”gender”:”F”},
>>>> {“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]}
>>>>
>>>> The database would be millions of such document and I want to know the
>>>> average age of the male passenger for every flight.
>>>>
>>>> In Dremel, the query would be something like: select flightnb,
>>>> avg(male_age) within record from (select groups.age as male_age from
>>>> flight
>>>> where group.gender = "M")
>>>>
>>>> With sql, it would be something like: select flightnb, avg(male_age)
>>>> from
>>>> (array(select g.age as male_age from unnest(group)as g where g.gender =
>>>> "M") as male_age)
>>>>
>>>> With impala it would be something like: select flightnb, avg(male) from
>>>> flight, select g.age from groups as g where g.gender = “M” as male
>>>>
>>>> With presto, it would be something like:  select flightnb, avg(male)
>>>> from
>>>> flight, filter(group,x->x.gender = "M")as male
>>>>
>>>> I am not sure at all about my SQL queries but it should give you a rough
>>>> idea about the different ways to express the inital query.
>>>>
>>>> So many different ways to express the same query… I would personally go
>>>> for the SQL way of expressing things to implement it in Drill,
>>>> especially
>>>> because calcite is already able to parse unnest, array, but that’s only
>>>> my
>>>> first thought.
>>>>
>>>> Best regards,
>>>>
>>>> Damien
>>>>
>>>>
>>>>
>>
>