You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Joel Pfaff <jo...@gmail.com> on 2017/10/04 14:26:32 UTC

Support of query for complex/nested data in calcite

Hello,

Sorry I was not registered in the mailing list when Luis Fernando Kauer
sent the mail, so I cannot reply directly to his.
But we share a similar concern with ability to implement more
precise/efficient queries on complex documents using SQL.

Currently in our system, the management of complex types requires using
UNNEST, and there is no way to express a transformation inside a nested
document without flattening it first. Or this flattening operation can be
expensive if it has to be followed by a GROUP BY clause to recreate a
similar document but with some nested data being filtered and/or aggregated
(min/max/avg/concatenation).

For similar purposes, Google Big Query proposes UNNEST and ARRAY keywords
(see  https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays)
and by allowing implicit joins between simple columns and nested columns,
it allows computations to stay local to the document.

Inside Impala, it is possible to create correlated sub-queries that apply
on a repetition inside a document (see
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_complex_types.html).
There is another kind of syntax, different from the one from Big Query, but
it works.

In Presto, it is possible to apply lambda function to map/array to
transform the structure and apply filter on it. So you have filter,
map_filter function to filter array and map respectively (see
https://prestodb.io/docs/current/functions/lambda.html#filter). So it is
yet another syntax.

From calcite's code, I can see that a good part of the code related to the
parsing is already here, there is already support of MAP/ARRAY/MULTISET,
and I am wondering what would be missing/different from Google Big Query
syntax?

Si I started to play, and I managed to run such a query end to end using a
test in JdbcTest:
@Test public void testUnnestArrayColumnWithFilter() {
  CalciteAssert.hr()
          .query(
        "select d.\"deptno\",\n"
          + " array(select x.\"empid\" * 2 from UNNEST(d.\"employees\") AS
x WHERE x.\"empid\" > 2) AS doubled_empno"
          + " from \"hr\".\"depts\" as d")
          .returnsUnordered(
                  "deptno=10; DOUBLED_EMPNO=[{200}, {300}]",
                  "deptno=30; DOUBLED_EMPNO=[]",
                  "deptno=40; DOUBLED_EMPNO=[{400}]");
}

SQL Parsing and Validation and Execution appeared to work just fine, and
that looks really nice.

I then used the explainContains() on that query to get access to the plan,
and I got:
PLAN=
EnumerableCalc(expr#0..4=[{inputs}], deptno=[$t0], DOUBLED_EMPNO=[$t4])
  EnumerableCorrelate(correlation=[$cor0], joinType=[inner],
requiredColumns=[{2}])
    EnumerableTableScan(table=[[hr, depts]])
    EnumerableCollect(field=[EXPR$0])
      EnumerableCalc(expr#0..4=[{inputs}], expr#5=[2], expr#6=[*($t0,
$t5)], expr#7=[>($t0, $t5)], EXPR$0=[$t6], $condition=[$t7])
        EnumerableUncollect
          EnumerableCalc(expr#0=[{inputs}], expr#1=[$cor0],
expr#2=[$t1.employees], employees=[$t2])
            EnumerableValues(tuples=[[{ 0 }]])

I wanted to check the plan generated by calcite in these queries, to
validate that the aggregations stay local to the document, and do not
trigger a full flatten/group by, in the case it is not needed.
And I am honestly now struggling to understand what's going one under the
hood. I have the feeling the uncollect/calc/collec sequence stays local to
every row, but I would like a confirmation.

Regards, Joel

PS:
I also tried to run on the interpreter an example adapted out of the Big
Query page:
That's what I added in
core/src/test/java/org/apache/calcite/test/SqlValidatorTest.java (method
testUnnestArrayColumn)

/** Tests executing a nested subquery using an interpreter. */
@Test public void testInterpretNestedSubquery() throws Exception {
  SqlNode parse =
      planner.parse("WITH sequences AS"
      + " (SELECT ARRAY[0, 1, 1, 2, 3, 5] AS some_numbers)"
      + " SELECT some_numbers, ARRAY(SELECT x * 2 "
      + " FROM UNNEST(some_numbers) AS x) AS doubled"
      + " FROM sequences");

  SqlNode validate = planner.validate(parse);
  RelNode convert = planner.rel(validate).rel;

  final Interpreter interpreter = new Interpreter(dataContext, convert);
  assertRows(interpreter, "[[0, 1, 1, 2, 3, 5],[0, 2, 2, 4, 6, 10]]"); // I
never reached that point
}

This would strangely fail with "Column 'X' not found in any table". I don't
know if that's the expected behaviour.

Re: Support of query for complex/nested data in calcite

Posted by Julian Hyde <jh...@apache.org>.

There's also CROSS APPLY and OUTER APPLY; see
https://issues.apache.org/jira/browse/CALCITE-1472.

On Thu, Oct 5, 2017 at 11:49 AM, Luis Fernando Kauer
<lf...@yahoo.com.br> wrote:
> I created:
> [CALCITE-1999] UNNEST generates a NPE when the parameter is null - ASF JIRA
> [CALCITE-2000] UNNEST a collection that has a field with nested data
> generates an Exception - ASF JIRA
>
> The syntax you posted as the equivalent standard query did not work (wrong
> syntax):
>   select d.name, e.empid, e.salary from hr.depts d left cross join
> unnest(d.employees) e
>
> What I managed to work for a left join is:
>   select d.name, e.empid, e.salary from hr.depts d left join
> unnest(d.employees) e on true
>
> The syntax below also worked but it uses inner join:
>   select d.name, e.empid, e.salary from hr.depts d cross join
> unnest(d.employees) e
>
> Regards,
>
> Luis Fernando
>
> Em quarta-feira, 4 de outubro de 2017 17:54:23 BRT, Julian Hyde
> <jh...@gmail.com> escreveu:
>
>
> Regarding nulls and FlatList. Can you please log a bug.
>
> Regarding the Impala query
>
>   select d.name, e.empid, e.salary from hr.depts d left join d.employees e
>
> The equivalent standard query is
>
>   select d.name, e.empid, e.salary from hr.depts d left cross join
> unnest(d.employees) e
>
> I can see how the Impala’s syntactic sugar might be helpful, although
> frankly it’s not much shorter. Feel free to log a JIRA case.
>
> Julian
>
>
>> On Oct 4, 2017, at 12:07 PM, Luis Fernando Kauer
>> <lf...@yahoo.com.br.INVALID> wrote:
>>
>>
>>
>> Hi Joel.
>>
>> Nice to hear someone else is also concerned about this.
>> I'm more concerned about nested structures than unnesting simple
>> arrays.What I found so far:- the nested collection must exist (can't be
>> null) or it generates a NPE.  In "hr" schema, try creating a department with
>> employees collection as null.  Unfortunately, this is a very common issue
>> for me.- the join type is "inner" by default, but since I have many
>> empty/null nested collections, often I need "left join" instead.- a nested
>> structure collection from a nested structure collection (nested more than
>> one level) always generates errors when trying to unnest the first level.
>> This is because Calcite uses FlatLists internally and a nested collection
>> field usually contains an array or List, which are not Comparable, so when
>> it tries do add the value of the column with nested data into the FlatList
>> it generates an error.- unnest reads the value the column passed as
>> parameter with the nested data for the current record and iterates through
>> it. So the column must return the whole nested collection with all nested
>> fields loaded.  But it would be nice if somehow it could select only the
>> nested fields that are used in the query.- It may not be the standard, but I
>> like Impala's syntax where you can simply query:select d.name, e.empid,
>> e.salary from hr.depts d left join d.employees eI think it is very simple to
>> use and to understand.
>> Best regards,
>>
>> Luis Fernando
>>
>>
>> Em quarta-feira, 4 de outubro de 2017 11:26:38 BRT, Joel Pfaff
>> <jo...@gmail.com> escreveu:
>>
>>
>>
>>
>>
>> Hello,
>>
>> Sorry I was not registered in the mailing list when Luis Fernando Kauer
>> sent the mail, so I cannot reply directly to his.
>> But we share a similar concern with ability to implement more
>> precise/efficient queries on complex documents using SQL.
>>
>> Currently in our system, the management of complex types requires using
>> UNNEST, and there is no way to express a transformation inside a nested
>> document without flattening it first. Or this flattening operation can be
>> expensive if it has to be followed by a GROUP BY clause to recreate a
>> similar document but with some nested data being filtered and/or
>> aggregated
>> (min/max/avg/concatenation).
>>
>> For similar purposes, Google Big Query proposes UNNEST and ARRAY keywords
>> (see
>> https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays)
>> and by allowing implicit joins between simple columns and nested columns,
>> it allows computations to stay local to the document.
>>
>> Inside Impala, it is possible to create correlated sub-queries that apply
>> on a repetition inside a document (see
>>
>> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_complex_types.html).
>> There is another kind of syntax, different from the one from Big Query,
>> but
>> it works.
>>
>> In Presto, it is possible to apply lambda function to map/array to
>> transform the structure and apply filter on it. So you have filter,
>> map_filter function to filter array and map respectively (see
>> https://prestodb.io/docs/current/functions/lambda.html#filter). So it is
>> yet another syntax.
>>
>> From calcite's code, I can see that a good part of the code related to the
>> parsing is already here, there is already support of MAP/ARRAY/MULTISET,
>> and I am wondering what would be missing/different from Google Big Query
>> syntax?
>>
>> Si I started to play, and I managed to run such a query end to end using a
>> test in JdbcTest:
>> @Test public void testUnnestArrayColumnWithFilter() {
>>  CalciteAssert.hr()
>>          .query(
>>        "select d.\"deptno\",\n"
>>          + " array(select x.\"empid\" * 2 from UNNEST(d.\"employees\") AS
>> x WHERE x.\"empid\" > 2) AS doubled_empno"
>>          + " from \"hr\".\"depts\" as d")
>>          .returnsUnordered(
>>                  "deptno=10; DOUBLED_EMPNO=[{200}, {300}]",
>>                  "deptno=30; DOUBLED_EMPNO=[]",
>>                  "deptno=40; DOUBLED_EMPNO=[{400}]");
>> }
>>
>> SQL Parsing and Validation and Execution appeared to work just fine, and
>> that looks really nice.
>>
>> I then used the explainContains() on that query to get access to the plan,
>> and I got:
>> PLAN=
>> EnumerableCalc(expr#0..4=[{inputs}], deptno=[$t0], DOUBLED_EMPNO=[$t4])
>>  EnumerableCorrelate(correlation=[$cor0], joinType=[inner],
>> requiredColumns=[{2}])
>>    EnumerableTableScan(table=[[hr, depts]])
>>    EnumerableCollect(field=[EXPR$0])
>>      EnumerableCalc(expr#0..4=[{inputs}], expr#5=[2], expr#6=[*($t0,
>> $t5)], expr#7=[>($t0, $t5)], EXPR$0=[$t6], $condition=[$t7])
>>        EnumerableUncollect
>>          EnumerableCalc(expr#0=[{inputs}], expr#1=[$cor0],
>> expr#2=[$t1.employees], employees=[$t2])
>>            EnumerableValues(tuples=[[{ 0 }]])
>>
>> I wanted to check the plan generated by calcite in these queries, to
>> validate that the aggregations stay local to the document, and do not
>> trigger a full flatten/group by, in the case it is not needed.
>> And I am honestly now struggling to understand what's going one under the
>> hood. I have the feeling the uncollect/calc/collec sequence stays local to
>> every row, but I would like a confirmation.
>>
>> Regards, Joel
>>
>> PS:
>> I also tried to run on the interpreter an example adapted out of the Big
>> Query page:
>> That's what I added in
>> core/src/test/java/org/apache/calcite/test/SqlValidatorTest.java (method
>> testUnnestArrayColumn)
>>
>> /** Tests executing a nested subquery using an interpreter. */
>> @Test public void testInterpretNestedSubquery() throws Exception {
>>  SqlNode parse =
>>      planner.parse("WITH sequences AS"
>>      + " (SELECT ARRAY[0, 1, 1, 2, 3, 5] AS some_numbers)"
>>      + " SELECT some_numbers, ARRAY(SELECT x * 2 "
>>      + " FROM UNNEST(some_numbers) AS x) AS doubled"
>>      + " FROM sequences");
>>
>>  SqlNode validate = planner.validate(parse);
>>  RelNode convert = planner.rel(validate).rel;
>>
>>  final Interpreter interpreter = new Interpreter(dataContext, convert);
>>  assertRows(interpreter, "[[0, 1, 1, 2, 3, 5],[0, 2, 2, 4, 6, 10]]"); // I
>> never reached that point
>> }
>>
>> This would strangely fail with "Column 'X' not found in any table". I
>> don't
>> know if that's the expected behaviour.
>>

Re: Support of query for complex/nested data in calcite

Posted by Luis Fernando Kauer <lf...@yahoo.com.br.INVALID>.

 I created:[CALCITE-1999] UNNEST generates a NPE when the parameter is null - ASF JIRA
[CALCITE-2000] UNNEST a collection that has a field with nested data generates an Exception - ASF JIRA

The syntax you posted as the equivalent standard query did not work (wrong syntax):  select d.name, e.empid, e.salary from hr.depts d left cross join unnest(d.employees) e

What I managed to work for a left join is:  select d.name, e.empid, e.salary from hr.depts d left join unnest(d.employees) e on true

The syntax below also worked but it uses inner join:  select d.name, e.empid, e.salary from hr.depts d cross join unnest(d.employees) e

Regards,
Luis Fernando
    Em quarta-feira, 4 de outubro de 2017 17:54:23 BRT, Julian Hyde <jh...@gmail.com> escreveu:  
 
 Regarding nulls and FlatList. Can you please log a bug.

Regarding the Impala query

  select d.name, e.empid, e.salary from hr.depts d left join d.employees e

The equivalent standard query is

  select d.name, e.empid, e.salary from hr.depts d left cross join unnest(d.employees) e

I can see how the Impala’s syntactic sugar might be helpful, although frankly it’s not much shorter. Feel free to log a JIRA case.

Julian


> On Oct 4, 2017, at 12:07 PM, Luis Fernando Kauer <lf...@yahoo.com.br.INVALID> wrote:
> 
> 
> 
> Hi Joel.
> 
> Nice to hear someone else is also concerned about this.
> I'm more concerned about nested structures than unnesting simple arrays.What I found so far:- the nested collection must exist (can't be null) or it generates a NPE.  In "hr" schema, try creating a department with employees collection as null.  Unfortunately, this is a very common issue for me.- the join type is "inner" by default, but since I have many empty/null nested collections, often I need "left join" instead.- a nested structure collection from a nested structure collection (nested more than one level) always generates errors when trying to unnest the first level.  This is because Calcite uses FlatLists internally and a nested collection field usually contains an array or List, which are not Comparable, so when it tries do add the value of the column with nested data into the FlatList it generates an error.- unnest reads the value the column passed as parameter with the nested data for the current record and iterates through it. So the column must return the whole nested collection with all nested fields loaded.  But it would be nice if somehow it could select only the nested fields that are used in the query.- It may not be the standard, but I like Impala's syntax where you can simply query:select d.name, e.empid, e.salary from hr.depts d left join d.employees eI think it is very simple to use and to understand.
> Best regards,
> 
> Luis Fernando
> 
> 
> Em quarta-feira, 4 de outubro de 2017 11:26:38 BRT, Joel Pfaff <jo...@gmail.com> escreveu: 
> 
> 
> 
> 
> 
> Hello,
> 
> Sorry I was not registered in the mailing list when Luis Fernando Kauer
> sent the mail, so I cannot reply directly to his.
> But we share a similar concern with ability to implement more
> precise/efficient queries on complex documents using SQL.
> 
> Currently in our system, the management of complex types requires using
> UNNEST, and there is no way to express a transformation inside a nested
> document without flattening it first. Or this flattening operation can be
> expensive if it has to be followed by a GROUP BY clause to recreate a
> similar document but with some nested data being filtered and/or aggregated
> (min/max/avg/concatenation).
> 
> For similar purposes, Google Big Query proposes UNNEST and ARRAY keywords
> (see  https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays)
> and by allowing implicit joins between simple columns and nested columns,
> it allows computations to stay local to the document.
> 
> Inside Impala, it is possible to create correlated sub-queries that apply
> on a repetition inside a document (see
> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_complex_types.html).
> There is another kind of syntax, different from the one from Big Query, but
> it works.
> 
> In Presto, it is possible to apply lambda function to map/array to
> transform the structure and apply filter on it. So you have filter,
> map_filter function to filter array and map respectively (see
> https://prestodb.io/docs/current/functions/lambda.html#filter). So it is
> yet another syntax.
> 
> From calcite's code, I can see that a good part of the code related to the
> parsing is already here, there is already support of MAP/ARRAY/MULTISET,
> and I am wondering what would be missing/different from Google Big Query
> syntax?
> 
> Si I started to play, and I managed to run such a query end to end using a
> test in JdbcTest:
> @Test public void testUnnestArrayColumnWithFilter() {
>  CalciteAssert.hr()
>          .query(
>        "select d.\"deptno\",\n"
>          + " array(select x.\"empid\" * 2 from UNNEST(d.\"employees\") AS
> x WHERE x.\"empid\" > 2) AS doubled_empno"
>          + " from \"hr\".\"depts\" as d")
>          .returnsUnordered(
>                  "deptno=10; DOUBLED_EMPNO=[{200}, {300}]",
>                  "deptno=30; DOUBLED_EMPNO=[]",
>                  "deptno=40; DOUBLED_EMPNO=[{400}]");
> }
> 
> SQL Parsing and Validation and Execution appeared to work just fine, and
> that looks really nice.
> 
> I then used the explainContains() on that query to get access to the plan,
> and I got:
> PLAN=
> EnumerableCalc(expr#0..4=[{inputs}], deptno=[$t0], DOUBLED_EMPNO=[$t4])
>  EnumerableCorrelate(correlation=[$cor0], joinType=[inner],
> requiredColumns=[{2}])
>    EnumerableTableScan(table=[[hr, depts]])
>    EnumerableCollect(field=[EXPR$0])
>      EnumerableCalc(expr#0..4=[{inputs}], expr#5=[2], expr#6=[*($t0,
> $t5)], expr#7=[>($t0, $t5)], EXPR$0=[$t6], $condition=[$t7])
>        EnumerableUncollect
>          EnumerableCalc(expr#0=[{inputs}], expr#1=[$cor0],
> expr#2=[$t1.employees], employees=[$t2])
>            EnumerableValues(tuples=[[{ 0 }]])
> 
> I wanted to check the plan generated by calcite in these queries, to
> validate that the aggregations stay local to the document, and do not
> trigger a full flatten/group by, in the case it is not needed.
> And I am honestly now struggling to understand what's going one under the
> hood. I have the feeling the uncollect/calc/collec sequence stays local to
> every row, but I would like a confirmation.
> 
> Regards, Joel
> 
> PS:
> I also tried to run on the interpreter an example adapted out of the Big
> Query page:
> That's what I added in
> core/src/test/java/org/apache/calcite/test/SqlValidatorTest.java (method
> testUnnestArrayColumn)
> 
> /** Tests executing a nested subquery using an interpreter. */
> @Test public void testInterpretNestedSubquery() throws Exception {
>  SqlNode parse =
>      planner.parse("WITH sequences AS"
>      + " (SELECT ARRAY[0, 1, 1, 2, 3, 5] AS some_numbers)"
>      + " SELECT some_numbers, ARRAY(SELECT x * 2 "
>      + " FROM UNNEST(some_numbers) AS x) AS doubled"
>      + " FROM sequences");
> 
>  SqlNode validate = planner.validate(parse);
>  RelNode convert = planner.rel(validate).rel;
> 
>  final Interpreter interpreter = new Interpreter(dataContext, convert);
>  assertRows(interpreter, "[[0, 1, 1, 2, 3, 5],[0, 2, 2, 4, 6, 10]]"); // I
> never reached that point
> }
> 
> This would strangely fail with "Column 'X' not found in any table". I don't
> know if that's the expected behaviour.
>

Re: Support of query for complex/nested data in calcite

Posted by Julian Hyde <jh...@gmail.com>.

Regarding nulls and FlatList. Can you please log a bug.

Regarding the Impala query

  select d.name, e.empid, e.salary from hr.depts d left join d.employees e

The equivalent standard query is

  select d.name, e.empid, e.salary from hr.depts d left cross join unnest(d.employees) e

I can see how the Impala’s syntactic sugar might be helpful, although frankly it’s not much shorter. Feel free to log a JIRA case.

Julian


> On Oct 4, 2017, at 12:07 PM, Luis Fernando Kauer <lf...@yahoo.com.br.INVALID> wrote:
> 
> 
> 
> Hi Joel.
> 
> Nice to hear someone else is also concerned about this.
> I'm more concerned about nested structures than unnesting simple arrays.What I found so far:- the nested collection must exist (can't be null) or it generates a NPE.  In "hr" schema, try creating a department with employees collection as null.  Unfortunately, this is a very common issue for me.- the join type is "inner" by default, but since I have many empty/null nested collections, often I need "left join" instead.- a nested structure collection from a nested structure collection (nested more than one level) always generates errors when trying to unnest the first level.  This is because Calcite uses FlatLists internally and a nested collection field usually contains an array or List, which are not Comparable, so when it tries do add the value of the column with nested data into the FlatList it generates an error.- unnest reads the value the column passed as parameter with the nested data for the current record and iterates through it. So the column must return the whole nested collection with all nested fields loaded.  But it would be nice if somehow it could select only the nested fields that are used in the query.- It may not be the standard, but I like Impala's syntax where you can simply query:select d.name, e.empid, e.salary from hr.depts d left join d.employees eI think it is very simple to use and to understand.
> Best regards,
> 
> Luis Fernando
> 
> 
> Em quarta-feira, 4 de outubro de 2017 11:26:38 BRT, Joel Pfaff <jo...@gmail.com> escreveu: 
> 
> 
> 
> 
> 
> Hello,
> 
> Sorry I was not registered in the mailing list when Luis Fernando Kauer
> sent the mail, so I cannot reply directly to his.
> But we share a similar concern with ability to implement more
> precise/efficient queries on complex documents using SQL.
> 
> Currently in our system, the management of complex types requires using
> UNNEST, and there is no way to express a transformation inside a nested
> document without flattening it first. Or this flattening operation can be
> expensive if it has to be followed by a GROUP BY clause to recreate a
> similar document but with some nested data being filtered and/or aggregated
> (min/max/avg/concatenation).
> 
> For similar purposes, Google Big Query proposes UNNEST and ARRAY keywords
> (see  https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays)
> and by allowing implicit joins between simple columns and nested columns,
> it allows computations to stay local to the document.
> 
> Inside Impala, it is possible to create correlated sub-queries that apply
> on a repetition inside a document (see
> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_complex_types.html).
> There is another kind of syntax, different from the one from Big Query, but
> it works.
> 
> In Presto, it is possible to apply lambda function to map/array to
> transform the structure and apply filter on it. So you have filter,
> map_filter function to filter array and map respectively (see
> https://prestodb.io/docs/current/functions/lambda.html#filter). So it is
> yet another syntax.
> 
> From calcite's code, I can see that a good part of the code related to the
> parsing is already here, there is already support of MAP/ARRAY/MULTISET,
> and I am wondering what would be missing/different from Google Big Query
> syntax?
> 
> Si I started to play, and I managed to run such a query end to end using a
> test in JdbcTest:
> @Test public void testUnnestArrayColumnWithFilter() {
>   CalciteAssert.hr()
>           .query(
>         "select d.\"deptno\",\n"
>           + " array(select x.\"empid\" * 2 from UNNEST(d.\"employees\") AS
> x WHERE x.\"empid\" > 2) AS doubled_empno"
>           + " from \"hr\".\"depts\" as d")
>           .returnsUnordered(
>                   "deptno=10; DOUBLED_EMPNO=[{200}, {300}]",
>                   "deptno=30; DOUBLED_EMPNO=[]",
>                   "deptno=40; DOUBLED_EMPNO=[{400}]");
> }
> 
> SQL Parsing and Validation and Execution appeared to work just fine, and
> that looks really nice.
> 
> I then used the explainContains() on that query to get access to the plan,
> and I got:
> PLAN=
> EnumerableCalc(expr#0..4=[{inputs}], deptno=[$t0], DOUBLED_EMPNO=[$t4])
>   EnumerableCorrelate(correlation=[$cor0], joinType=[inner],
> requiredColumns=[{2}])
>     EnumerableTableScan(table=[[hr, depts]])
>     EnumerableCollect(field=[EXPR$0])
>       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[2], expr#6=[*($t0,
> $t5)], expr#7=[>($t0, $t5)], EXPR$0=[$t6], $condition=[$t7])
>         EnumerableUncollect
>           EnumerableCalc(expr#0=[{inputs}], expr#1=[$cor0],
> expr#2=[$t1.employees], employees=[$t2])
>             EnumerableValues(tuples=[[{ 0 }]])
> 
> I wanted to check the plan generated by calcite in these queries, to
> validate that the aggregations stay local to the document, and do not
> trigger a full flatten/group by, in the case it is not needed.
> And I am honestly now struggling to understand what's going one under the
> hood. I have the feeling the uncollect/calc/collec sequence stays local to
> every row, but I would like a confirmation.
> 
> Regards, Joel
> 
> PS:
> I also tried to run on the interpreter an example adapted out of the Big
> Query page:
> That's what I added in
> core/src/test/java/org/apache/calcite/test/SqlValidatorTest.java (method
> testUnnestArrayColumn)
> 
> /** Tests executing a nested subquery using an interpreter. */
> @Test public void testInterpretNestedSubquery() throws Exception {
>   SqlNode parse =
>       planner.parse("WITH sequences AS"
>       + " (SELECT ARRAY[0, 1, 1, 2, 3, 5] AS some_numbers)"
>       + " SELECT some_numbers, ARRAY(SELECT x * 2 "
>       + " FROM UNNEST(some_numbers) AS x) AS doubled"
>       + " FROM sequences");
> 
>   SqlNode validate = planner.validate(parse);
>   RelNode convert = planner.rel(validate).rel;
> 
>   final Interpreter interpreter = new Interpreter(dataContext, convert);
>   assertRows(interpreter, "[[0, 1, 1, 2, 3, 5],[0, 2, 2, 4, 6, 10]]"); // I
> never reached that point
> }
> 
> This would strangely fail with "Column 'X' not found in any table". I don't
> know if that's the expected behaviour.
>

Re: Support of query for complex/nested data in calcite

Posted by Luis Fernando Kauer <lf...@yahoo.com.br.INVALID>.


Hi Joel.

Nice to hear someone else is also concerned about this.
I'm more concerned about nested structures than unnesting simple arrays.What I found so far:- the nested collection must exist (can't be null) or it generates a NPE.  In "hr" schema, try creating a department with employees collection as null.  Unfortunately, this is a very common issue for me.- the join type is "inner" by default, but since I have many empty/null nested collections, often I need "left join" instead.- a nested structure collection from a nested structure collection (nested more than one level) always generates errors when trying to unnest the first level.  This is because Calcite uses FlatLists internally and a nested collection field usually contains an array or List, which are not Comparable, so when it tries do add the value of the column with nested data into the FlatList it generates an error.- unnest reads the value the column passed as parameter with the nested data for the current record and iterates through it. So the column must return the whole nested collection with all nested fields loaded.  But it would be nice if somehow it could select only the nested fields that are used in the query.- It may not be the standard, but I like Impala's syntax where you can simply query:select d.name, e.empid, e.salary from hr.depts d left join d.employees eI think it is very simple to use and to understand.
Best regards,

Luis Fernando
 
 
 Em quarta-feira, 4 de outubro de 2017 11:26:38 BRT, Joel Pfaff <jo...@gmail.com> escreveu: 





Hello,

Sorry I was not registered in the mailing list when Luis Fernando Kauer
sent the mail, so I cannot reply directly to his.
But we share a similar concern with ability to implement more
precise/efficient queries on complex documents using SQL.

Currently in our system, the management of complex types requires using
UNNEST, and there is no way to express a transformation inside a nested
document without flattening it first. Or this flattening operation can be
expensive if it has to be followed by a GROUP BY clause to recreate a
similar document but with some nested data being filtered and/or aggregated
(min/max/avg/concatenation).

For similar purposes, Google Big Query proposes UNNEST and ARRAY keywords
(see  https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays)
and by allowing implicit joins between simple columns and nested columns,
it allows computations to stay local to the document.

Inside Impala, it is possible to create correlated sub-queries that apply
on a repetition inside a document (see
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_complex_types.html).
There is another kind of syntax, different from the one from Big Query, but
it works.

In Presto, it is possible to apply lambda function to map/array to
transform the structure and apply filter on it. So you have filter,
map_filter function to filter array and map respectively (see
https://prestodb.io/docs/current/functions/lambda.html#filter). So it is
yet another syntax.

From calcite's code, I can see that a good part of the code related to the
parsing is already here, there is already support of MAP/ARRAY/MULTISET,
and I am wondering what would be missing/different from Google Big Query
syntax?

Si I started to play, and I managed to run such a query end to end using a
test in JdbcTest:
@Test public void testUnnestArrayColumnWithFilter() {
  CalciteAssert.hr()
          .query(
        "select d.\"deptno\",\n"
          + " array(select x.\"empid\" * 2 from UNNEST(d.\"employees\") AS
x WHERE x.\"empid\" > 2) AS doubled_empno"
          + " from \"hr\".\"depts\" as d")
          .returnsUnordered(
                  "deptno=10; DOUBLED_EMPNO=[{200}, {300}]",
                  "deptno=30; DOUBLED_EMPNO=[]",
                  "deptno=40; DOUBLED_EMPNO=[{400}]");
}

SQL Parsing and Validation and Execution appeared to work just fine, and
that looks really nice.

I then used the explainContains() on that query to get access to the plan,
and I got:
PLAN=
EnumerableCalc(expr#0..4=[{inputs}], deptno=[$t0], DOUBLED_EMPNO=[$t4])
  EnumerableCorrelate(correlation=[$cor0], joinType=[inner],
requiredColumns=[{2}])
    EnumerableTableScan(table=[[hr, depts]])
    EnumerableCollect(field=[EXPR$0])
      EnumerableCalc(expr#0..4=[{inputs}], expr#5=[2], expr#6=[*($t0,
$t5)], expr#7=[>($t0, $t5)], EXPR$0=[$t6], $condition=[$t7])
        EnumerableUncollect
          EnumerableCalc(expr#0=[{inputs}], expr#1=[$cor0],
expr#2=[$t1.employees], employees=[$t2])
            EnumerableValues(tuples=[[{ 0 }]])

I wanted to check the plan generated by calcite in these queries, to
validate that the aggregations stay local to the document, and do not
trigger a full flatten/group by, in the case it is not needed.
And I am honestly now struggling to understand what's going one under the
hood. I have the feeling the uncollect/calc/collec sequence stays local to
every row, but I would like a confirmation.

Regards, Joel

PS:
I also tried to run on the interpreter an example adapted out of the Big
Query page:
That's what I added in
core/src/test/java/org/apache/calcite/test/SqlValidatorTest.java (method
testUnnestArrayColumn)

/** Tests executing a nested subquery using an interpreter. */
@Test public void testInterpretNestedSubquery() throws Exception {
  SqlNode parse =
      planner.parse("WITH sequences AS"
      + " (SELECT ARRAY[0, 1, 1, 2, 3, 5] AS some_numbers)"
      + " SELECT some_numbers, ARRAY(SELECT x * 2 "
      + " FROM UNNEST(some_numbers) AS x) AS doubled"
      + " FROM sequences");

  SqlNode validate = planner.validate(parse);
  RelNode convert = planner.rel(validate).rel;

  final Interpreter interpreter = new Interpreter(dataContext, convert);
  assertRows(interpreter, "[[0, 1, 1, 2, 3, 5],[0, 2, 2, 4, 6, 10]]"); // I
never reached that point
}

This would strangely fail with "Column 'X' not found in any table". I don't
know if that's the expected behaviour.