You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Damien Profeta <da...@amadeus.com> on 2017/09/25 13:10:22 UTC

Food for thought about intra-document operation

Hello,

A few format handled by Drill enable to work with document, meaning
nested and repeated structure instead of just tables. Json and Parquet
are the two that come to my mind right now. Document modeling is a great
way to express complex object and is used a lot in my company. Drill is
able to handle them but unfortunately, it cannot make much computation
on it. By computation I mean, filtering branches of the document,
computing statistics (avg, min, max) on part of the document … That
would be very useful as an analytic tools.

_What can be done_

The question then is how to express the computation we want to do on the
document. I have found multiple ways to handle that and I don't really
know which one is the best hence the mail to expose what I have found to
initiate discussion, maybe.

First, in we look back at the Dremel paper which is the base of the
parquet format and also one of the example for drill, dremel is adding
the special keyword "WITHIN" to SQL to specify that the computation has
to be done within a document. What is very powerful with this keyword is
that it allows you to generate document and doesn't force you to flatten
everything. You can find exemple of it usage in the google successor of
Dremel: BigQuery and its documentation :
https://cloud.google.com/bigquery/docs/legacy-nested-repeated.

But it seems that it was problematic for Google, because they now
propose a SQL that seems to be compliant with SQL 2011 for Bigquery to
handle such computation. I am not familiar with SQL 2011 but it is told
in BigQuery documentation to integrated the keywords for nested and
repeated structure. You can have a view about how this is done in
BigQuery here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays .
Basically, what I have seen is that they leverage UNNEST and ARRAY
keyword and then are able to use JOIN or CROSS JOIN to describe the
aggregation.

In Impala, they have added a way to add a subquery on a complex type in
such a way that the subquery only act intra-document. I have no idea if
this is standard SQL or not. In page
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types
look at the phrase: “The subquery labelled SUBQ1 is correlated:” for
example.

In Presto, you can apply lambda function to map/array to transform the
structure and apply filter on it. So you have filter, map_filter
function to filter array and map respectively. (cf
https://prestodb.io/docs/current/functions/lambda.html#filter)

_Example_

If I want to make a short example, let’s say we have a flight with a
group of passengers in it. A document would be :

{ “flightnb”:1234,
“group”:[{“age”:30,”gender”:”M”},{“age”:15,”gender”:”F”},
{“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]}

The database would be millions of such document and I want to know the
average age of the male passenger for every flight.

In Dremel, the query would be something like: select flightnb,
avg(male_age) within record from (select groups.age as male_age from
flight where group.gender = "M")

With sql, it would be something like: select flightnb, avg(male_age)
from (array(select g.age as male_age from unnest(group)as g where
g.gender = "M") as male_age)

With impala it would be something like: select flightnb, avg(male) from
flight, select g.age from groups as g where g.gender = “M” as male

With presto, it would be something like: select flightnb, avg(male)
from flight, filter(group,x->x.gender = "M")as male

I am not sure at all about my SQL queries but it should give you a rough
idea about the different ways to express the inital query.

So many different ways to express the same query… I would personally go
for the SQL way of expressing things to implement it in Drill,
especially because calcite is already able to parse unnest, array, but
that’s only my first thought.

Best regards,

Damien

Re: [EXT] Re: Food for thought about intra-document operation

Posted by Aman Sinha <am...@apache.org>.

I have attached a work-in-progress design doc to https://issues.apache.
org/jira/browse/DRILL-5999.  Hope to finalize the design over the next week
or so.

-Aman

On Wed, Nov 29, 2017 at 9:36 AM, Aman Sinha <am...@apache.org> wrote:

> Damien,
> for the intra-document operations, it would be useful to add support for
> LATERAL joins (SQL standard), which in conjunction with UNNEST (or FLATTEN)
> should address the use case you have.  I have filed a JIRA for this:
> https://issues.apache.org/jira/browse/DRILL-5999.
>
> -Aman
>
> On Tue, Sep 26, 2017 at 12:04 AM, Damien Profeta <
> damien.profeta@amadeus.com> wrote:
>
>> Hello Aman,
>>
>> AsterixDb seems to follow the standard SQL with a few minor modifications
>> and add functions to ease aggregations (array_count, array_avg…)
>>
>> That would tend to confirm at least that the support of unnest is a good
>> idea to improve Drill.
>>
>> Best regards
>>
>> Damien
>>
>> **
>>
>> On 09/25/2017 07:53 PM, Aman Sinha wrote:
>>
>>> Damien,
>>> thanks for initiating the discussion..indeed this would be a very useful
>>> enhancement.  Currently, Drill provides repeated_contains()  for
>>> filtering
>>> and repeated_count() for count aggregates on arrays but not the general
>>> purpose intra-document operations that you need based on your example.
>>> I haven't gone through all the alternatives but in addition to what you
>>> have described,  you might also want to look at SQL++ (
>>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html) which has
>>> been
>>> adopted by AsterixDB and has syntax extensions to SQL for unstructured
>>> data.
>>>
>>> -Aman
>>>
>>> On Mon, Sep 25, 2017 at 6:10 AM, Damien Profeta <
>>> damien.profeta@amadeus.com>
>>> wrote:
>>>
>>> Hello,
>>>>
>>>> A few format handled by Drill enable to work with document, meaning
>>>> nested
>>>> and repeated structure instead of just tables. Json and Parquet are the
>>>> two
>>>> that come to my mind right now. Document modeling is a great way to
>>>> express
>>>> complex object and is used a lot in my company. Drill is able to handle
>>>> them but unfortunately, it cannot make much computation on it. By
>>>> computation I mean, filtering branches of the document, computing
>>>> statistics (avg, min, max) on part of the document … That would be very
>>>> useful as an analytic tools.
>>>>
>>>> _What can be done_
>>>>
>>>> The question then is how to express the computation we want to do on the
>>>> document. I have found multiple ways to handle that and I don't really
>>>> know
>>>> which one is the best hence the mail to expose what I have found to
>>>> initiate discussion, maybe.
>>>>
>>>> First, in we look back at the Dremel paper which is the base of the
>>>> parquet format and also one of the example for drill, dremel is adding
>>>> the
>>>> special keyword "WITHIN" to SQL to specify that the computation has to
>>>> be
>>>> done within a document. What is very powerful with this keyword is that
>>>> it
>>>> allows you to generate document and doesn't force you to flatten
>>>> everything. You can find exemple of it usage in the google successor of
>>>> Dremel: BigQuery and its documentation : https://cloud.google.com/bigqu
>>>> ery/docs/legacy-nested-repeated.
>>>>
>>>> But it seems that it was problematic for Google, because they now
>>>> propose
>>>> a SQL that seems to be compliant with SQL 2011 for Bigquery to handle
>>>> such
>>>> computation. I am not familiar with SQL 2011 but it is told in BigQuery
>>>> documentation to integrated the keywords for nested and repeated
>>>> structure.
>>>> You can have a view about how this is done in BigQuery here:
>>>> https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays .
>>>> Basically, what I have seen is that they leverage UNNEST and ARRAY
>>>> keyword
>>>> and then are able to use JOIN or CROSS JOIN to describe the aggregation.
>>>>
>>>> In Impala, they have added a way to add a subquery on a complex type in
>>>> such a way that the subquery only act intra-document. I have no idea if
>>>> this is standard SQL or not. In page https://www.cloudera.com/docum
>>>> entation/enterprise/5-5-x/topics/impala_complex_types.html#c
>>>> omplex_types
>>>> look at the phrase: “The subquery labelled SUBQ1 is correlated:” for
>>>> example.
>>>>
>>>> In Presto, you can apply lambda function to map/array to transform the
>>>> structure and apply filter on it. So you have filter, map_filter
>>>> function
>>>> to filter array and map respectively. (cf
>>>> https://prestodb.io/docs/curre
>>>> nt/functions/lambda.html#filter)
>>>>
>>>> _Example_
>>>>
>>>> If I want to make a short example, let’s say we have a flight with a
>>>> group
>>>> of passengers in it. A document would be :
>>>>
>>>> { “flightnb”:1234, “group”:[{“age”:30,”gender”:”M
>>>> ”},{“age”:15,”gender”:”F”},
>>>> {“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]}
>>>>
>>>> The database would be millions of such document and I want to know the
>>>> average age of the male passenger for every flight.
>>>>
>>>> In Dremel, the query would be something like: select flightnb,
>>>> avg(male_age) within record from (select groups.age as male_age from
>>>> flight
>>>> where group.gender = "M")
>>>>
>>>> With sql, it would be something like: select flightnb, avg(male_age)
>>>> from
>>>> (array(select g.age as male_age from unnest(group)as g where g.gender =
>>>> "M") as male_age)
>>>>
>>>> With impala it would be something like: select flightnb, avg(male) from
>>>> flight, select g.age from groups as g where g.gender = “M” as male
>>>>
>>>> With presto, it would be something like:  select flightnb, avg(male)
>>>> from
>>>> flight, filter(group,x->x.gender = "M")as male
>>>>
>>>> I am not sure at all about my SQL queries but it should give you a rough
>>>> idea about the different ways to express the inital query.
>>>>
>>>> So many different ways to express the same query… I would personally go
>>>> for the SQL way of expressing things to implement it in Drill,
>>>> especially
>>>> because calcite is already able to parse unnest, array, but that’s only
>>>> my
>>>> first thought.
>>>>
>>>> Best regards,
>>>>
>>>> Damien
>>>>
>>>>
>>>>
>>
>

Re: [EXT] Re: Food for thought about intra-document operation

Posted by Aman Sinha <am...@apache.org>.

Damien,
for the intra-document operations, it would be useful to add support for
LATERAL joins (SQL standard), which in conjunction with UNNEST (or FLATTEN)
should address the use case you have.  I have filed a JIRA for this:
https://issues.apache.org/jira/browse/DRILL-5999.

-Aman

On Tue, Sep 26, 2017 at 12:04 AM, Damien Profeta <damien.profeta@amadeus.com
> wrote:

> Hello Aman,
>
> AsterixDb seems to follow the standard SQL with a few minor modifications
> and add functions to ease aggregations (array_count, array_avg…)
>
> That would tend to confirm at least that the support of unnest is a good
> idea to improve Drill.
>
> Best regards
>
> Damien
>
> **
>
> On 09/25/2017 07:53 PM, Aman Sinha wrote:
>
>> Damien,
>> thanks for initiating the discussion..indeed this would be a very useful
>> enhancement.  Currently, Drill provides repeated_contains()  for filtering
>> and repeated_count() for count aggregates on arrays but not the general
>> purpose intra-document operations that you need based on your example.
>> I haven't gone through all the alternatives but in addition to what you
>> have described,  you might also want to look at SQL++ (
>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html) which has
>> been
>> adopted by AsterixDB and has syntax extensions to SQL for unstructured
>> data.
>>
>> -Aman
>>
>> On Mon, Sep 25, 2017 at 6:10 AM, Damien Profeta <
>> damien.profeta@amadeus.com>
>> wrote:
>>
>> Hello,
>>>
>>> A few format handled by Drill enable to work with document, meaning
>>> nested
>>> and repeated structure instead of just tables. Json and Parquet are the
>>> two
>>> that come to my mind right now. Document modeling is a great way to
>>> express
>>> complex object and is used a lot in my company. Drill is able to handle
>>> them but unfortunately, it cannot make much computation on it. By
>>> computation I mean, filtering branches of the document, computing
>>> statistics (avg, min, max) on part of the document … That would be very
>>> useful as an analytic tools.
>>>
>>> _What can be done_
>>>
>>> The question then is how to express the computation we want to do on the
>>> document. I have found multiple ways to handle that and I don't really
>>> know
>>> which one is the best hence the mail to expose what I have found to
>>> initiate discussion, maybe.
>>>
>>> First, in we look back at the Dremel paper which is the base of the
>>> parquet format and also one of the example for drill, dremel is adding
>>> the
>>> special keyword "WITHIN" to SQL to specify that the computation has to be
>>> done within a document. What is very powerful with this keyword is that
>>> it
>>> allows you to generate document and doesn't force you to flatten
>>> everything. You can find exemple of it usage in the google successor of
>>> Dremel: BigQuery and its documentation : https://cloud.google.com/bigqu
>>> ery/docs/legacy-nested-repeated.
>>>
>>> But it seems that it was problematic for Google, because they now propose
>>> a SQL that seems to be compliant with SQL 2011 for Bigquery to handle
>>> such
>>> computation. I am not familiar with SQL 2011 but it is told in BigQuery
>>> documentation to integrated the keywords for nested and repeated
>>> structure.
>>> You can have a view about how this is done in BigQuery here:
>>> https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays .
>>> Basically, what I have seen is that they leverage UNNEST and ARRAY
>>> keyword
>>> and then are able to use JOIN or CROSS JOIN to describe the aggregation.
>>>
>>> In Impala, they have added a way to add a subquery on a complex type in
>>> such a way that the subquery only act intra-document. I have no idea if
>>> this is standard SQL or not. In page https://www.cloudera.com/docum
>>> entation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types
>>> look at the phrase: “The subquery labelled SUBQ1 is correlated:” for
>>> example.
>>>
>>> In Presto, you can apply lambda function to map/array to transform the
>>> structure and apply filter on it. So you have filter, map_filter function
>>> to filter array and map respectively. (cf https://prestodb.io/docs/curre
>>> nt/functions/lambda.html#filter)
>>>
>>> _Example_
>>>
>>> If I want to make a short example, let’s say we have a flight with a
>>> group
>>> of passengers in it. A document would be :
>>>
>>> { “flightnb”:1234, “group”:[{“age”:30,”gender”:”M
>>> ”},{“age”:15,”gender”:”F”},
>>> {“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]}
>>>
>>> The database would be millions of such document and I want to know the
>>> average age of the male passenger for every flight.
>>>
>>> In Dremel, the query would be something like: select flightnb,
>>> avg(male_age) within record from (select groups.age as male_age from
>>> flight
>>> where group.gender = "M")
>>>
>>> With sql, it would be something like: select flightnb, avg(male_age) from
>>> (array(select g.age as male_age from unnest(group)as g where g.gender =
>>> "M") as male_age)
>>>
>>> With impala it would be something like: select flightnb, avg(male) from
>>> flight, select g.age from groups as g where g.gender = “M” as male
>>>
>>> With presto, it would be something like:  select flightnb, avg(male) from
>>> flight, filter(group,x->x.gender = "M")as male
>>>
>>> I am not sure at all about my SQL queries but it should give you a rough
>>> idea about the different ways to express the inital query.
>>>
>>> So many different ways to express the same query… I would personally go
>>> for the SQL way of expressing things to implement it in Drill, especially
>>> because calcite is already able to parse unnest, array, but that’s only
>>> my
>>> first thought.
>>>
>>> Best regards,
>>>
>>> Damien
>>>
>>>
>>>
>

Re: [EXT] Re: Food for thought about intra-document operation

Posted by Damien Profeta <da...@amadeus.com>.

Hello Aman,

AsterixDb seems to follow the standard SQL with a few minor 
modifications and add functions to ease aggregations (array_count, 
array_avg…)

That would tend to confirm at least that the support of unnest is a good 
idea to improve Drill.

Best regards

Damien

**
On 09/25/2017 07:53 PM, Aman Sinha wrote:
> Damien,
> thanks for initiating the discussion..indeed this would be a very useful
> enhancement.  Currently, Drill provides repeated_contains()  for filtering
> and repeated_count() for count aggregates on arrays but not the general
> purpose intra-document operations that you need based on your example.
> I haven't gone through all the alternatives but in addition to what you
> have described,  you might also want to look at SQL++ (
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html) which has been
> adopted by AsterixDB and has syntax extensions to SQL for unstructured
> data.
>
> -Aman
>
> On Mon, Sep 25, 2017 at 6:10 AM, Damien Profeta <da...@amadeus.com>
> wrote:
>
>> Hello,
>>
>> A few format handled by Drill enable to work with document, meaning nested
>> and repeated structure instead of just tables. Json and Parquet are the two
>> that come to my mind right now. Document modeling is a great way to express
>> complex object and is used a lot in my company. Drill is able to handle
>> them but unfortunately, it cannot make much computation on it. By
>> computation I mean, filtering branches of the document, computing
>> statistics (avg, min, max) on part of the document … That would be very
>> useful as an analytic tools.
>>
>> _What can be done_
>>
>> The question then is how to express the computation we want to do on the
>> document. I have found multiple ways to handle that and I don't really know
>> which one is the best hence the mail to expose what I have found to
>> initiate discussion, maybe.
>>
>> First, in we look back at the Dremel paper which is the base of the
>> parquet format and also one of the example for drill, dremel is adding the
>> special keyword "WITHIN" to SQL to specify that the computation has to be
>> done within a document. What is very powerful with this keyword is that it
>> allows you to generate document and doesn't force you to flatten
>> everything. You can find exemple of it usage in the google successor of
>> Dremel: BigQuery and its documentation : https://cloud.google.com/bigqu
>> ery/docs/legacy-nested-repeated.
>>
>> But it seems that it was problematic for Google, because they now propose
>> a SQL that seems to be compliant with SQL 2011 for Bigquery to handle such
>> computation. I am not familiar with SQL 2011 but it is told in BigQuery
>> documentation to integrated the keywords for nested and repeated structure.
>> You can have a view about how this is done in BigQuery here:
>> https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays .
>> Basically, what I have seen is that they leverage UNNEST and ARRAY keyword
>> and then are able to use JOIN or CROSS JOIN to describe the aggregation.
>>
>> In Impala, they have added a way to add a subquery on a complex type in
>> such a way that the subquery only act intra-document. I have no idea if
>> this is standard SQL or not. In page https://www.cloudera.com/docum
>> entation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types
>> look at the phrase: “The subquery labelled SUBQ1 is correlated:” for
>> example.
>>
>> In Presto, you can apply lambda function to map/array to transform the
>> structure and apply filter on it. So you have filter, map_filter function
>> to filter array and map respectively. (cf https://prestodb.io/docs/curre
>> nt/functions/lambda.html#filter)
>>
>> _Example_
>>
>> If I want to make a short example, let’s say we have a flight with a group
>> of passengers in it. A document would be :
>>
>> { “flightnb”:1234, “group”:[{“age”:30,”gender”:”M”},{“age”:15,”gender”:”F”},
>> {“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]}
>>
>> The database would be millions of such document and I want to know the
>> average age of the male passenger for every flight.
>>
>> In Dremel, the query would be something like: select flightnb,
>> avg(male_age) within record from (select groups.age as male_age from flight
>> where group.gender = "M")
>>
>> With sql, it would be something like: select flightnb, avg(male_age) from
>> (array(select g.age as male_age from unnest(group)as g where g.gender =
>> "M") as male_age)
>>
>> With impala it would be something like: select flightnb, avg(male) from
>> flight, select g.age from groups as g where g.gender = “M” as male
>>
>> With presto, it would be something like:  select flightnb, avg(male) from
>> flight, filter(group,x->x.gender = "M")as male
>>
>> I am not sure at all about my SQL queries but it should give you a rough
>> idea about the different ways to express the inital query.
>>
>> So many different ways to express the same query… I would personally go
>> for the SQL way of expressing things to implement it in Drill, especially
>> because calcite is already able to parse unnest, array, but that’s only my
>> first thought.
>>
>> Best regards,
>>
>> Damien
>>
>>

Re: Food for thought about intra-document operation

Posted by Aman Sinha <am...@apache.org>.

Damien,
thanks for initiating the discussion..indeed this would be a very useful
enhancement.  Currently, Drill provides repeated_contains()  for filtering
and repeated_count() for count aggregates on arrays but not the general
purpose intra-document operations that you need based on your example.
I haven't gone through all the alternatives but in addition to what you
have described,  you might also want to look at SQL++ (
https://ci.apache.org/projects/asterixdb/sqlpp/manual.html) which has been
adopted by AsterixDB and has syntax extensions to SQL for unstructured
data.

-Aman

On Mon, Sep 25, 2017 at 6:10 AM, Damien Profeta <da...@amadeus.com>
wrote:

> Hello,
>
> A few format handled by Drill enable to work with document, meaning nested
> and repeated structure instead of just tables. Json and Parquet are the two
> that come to my mind right now. Document modeling is a great way to express
> complex object and is used a lot in my company. Drill is able to handle
> them but unfortunately, it cannot make much computation on it. By
> computation I mean, filtering branches of the document, computing
> statistics (avg, min, max) on part of the document … That would be very
> useful as an analytic tools.
>
> _What can be done_
>
> The question then is how to express the computation we want to do on the
> document. I have found multiple ways to handle that and I don't really know
> which one is the best hence the mail to expose what I have found to
> initiate discussion, maybe.
>
> First, in we look back at the Dremel paper which is the base of the
> parquet format and also one of the example for drill, dremel is adding the
> special keyword "WITHIN" to SQL to specify that the computation has to be
> done within a document. What is very powerful with this keyword is that it
> allows you to generate document and doesn't force you to flatten
> everything. You can find exemple of it usage in the google successor of
> Dremel: BigQuery and its documentation : https://cloud.google.com/bigqu
> ery/docs/legacy-nested-repeated.
>
> But it seems that it was problematic for Google, because they now propose
> a SQL that seems to be compliant with SQL 2011 for Bigquery to handle such
> computation. I am not familiar with SQL 2011 but it is told in BigQuery
> documentation to integrated the keywords for nested and repeated structure.
> You can have a view about how this is done in BigQuery here:
> https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays .
> Basically, what I have seen is that they leverage UNNEST and ARRAY keyword
> and then are able to use JOIN or CROSS JOIN to describe the aggregation.
>
> In Impala, they have added a way to add a subquery on a complex type in
> such a way that the subquery only act intra-document. I have no idea if
> this is standard SQL or not. In page https://www.cloudera.com/docum
> entation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types
> look at the phrase: “The subquery labelled SUBQ1 is correlated:” for
> example.
>
> In Presto, you can apply lambda function to map/array to transform the
> structure and apply filter on it. So you have filter, map_filter function
> to filter array and map respectively. (cf https://prestodb.io/docs/curre
> nt/functions/lambda.html#filter)
>
> _Example_
>
> If I want to make a short example, let’s say we have a flight with a group
> of passengers in it. A document would be :
>
> { “flightnb”:1234, “group”:[{“age”:30,”gender”:”M”},{“age”:15,”gender”:”F”},
> {“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]}
>
> The database would be millions of such document and I want to know the
> average age of the male passenger for every flight.
>
> In Dremel, the query would be something like: select flightnb,
> avg(male_age) within record from (select groups.age as male_age from flight
> where group.gender = "M")
>
> With sql, it would be something like: select flightnb, avg(male_age) from
> (array(select g.age as male_age from unnest(group)as g where g.gender =
> "M") as male_age)
>
> With impala it would be something like: select flightnb, avg(male) from
> flight, select g.age from groups as g where g.gender = “M” as male
>
> With presto, it would be something like:  select flightnb, avg(male) from
> flight, filter(group,x->x.gender = "M")as male
>
> I am not sure at all about my SQL queries but it should give you a rough
> idea about the different ways to express the inital query.
>
> So many different ways to express the same query… I would personally go
> for the SQL way of expressing things to implement it in Drill, especially
> because calcite is already able to parse unnest, array, but that’s only my
> first thought.
>
> Best regards,
>
> Damien
>
>