You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Christian Tzolov <ct...@pivotal.io> on 2017/11/10 17:42:04 UTC

Project + Sort on single and on multiple columns

I've observed in my
no-sql adapter
 implementation that for q
ueries with
P
roject
 +
S
ort by
ONE
 column
t
he
Project

is pushed (as expected) 
before the Sort but for Sort
on MULTIPLE
 columns
the Sort is before the Project.
For example
for a query with one
sort column:

SELECT yearPublished FROM BookMaster ORDER BY yearPublished ASC

The plan looks like expected (project before the sort)


PLAN=GeodeToEnumerableConverterRel
  *GeodeSortRel*(sort0=[$0], dir0=[ASC])
       GeodeProjectRel(yearPublished=[$2])
           GeodeTableScanRel(table=[[TEST, BookMaster]])

But
 for sort
with
two

columns:

SELECT yearPublished, itemNumber from BookMaster ORDER BY yearPublished
ASC, itemNumber ASC

The
the plan is:


PLAN=GeodeToEnumerableConverterRel
  GeodeProjectRel(yearPublished=[$2], itemNumber=[$0])
    *GeodeSortRel*(sort0=[$2], sort1=[$0], dir0=[ASC], dir1=[ASC])
      GeodeTableScanRel(table=[[TEST, BookMaster]])

I'm not sure i can explain
 why in the second case the Sort appears before the Project?
Here
are my cost functions:

* 
GeodeSortRel
:
https://github.com/tzolov/calcite/blob/geode-1.3/geode/src/main/java/org/apache/calcite/adapter/geode/rel/GeodeSortRel.java#L51

* GoedeProjectRel:
https://github.com/tzolov/calcite/blob/4a631d9055340f64f5e644454551f964ea08f9de/geode/src/main/java/org/apache/calcite/adapter/geode/rel/GeodeProjectRel.java#L52



Cheers,
Christian

Re: Project + Sort on single and on multiple columns

Posted by Luis Fernando Kauer <lf...@yahoo.com.br.INVALID>.

There is something I don't get about Sort in Calcite.  Actually, it seems to happen already when parsing the SQL. 
Sort does not limit the projects, but it requires that the projects used for ordering are included in its input.
How can a Sort be the top node when there are projects used for ordering that are not in the select list? I thought there should always be a top project node selecting the projects to return when it needs to restrict or change from the previous node.  The problem is that the lack of a Project node on top restricting the projects will make adapters return more columns than are actually necessary.


Example 1:
>> SELECT empno FROM scott.emp ORDER BY ename

----  Logical Plan:
LogicalSort(sort0=[$1], dir0=[ASC])
  LogicalProject(EMPNO=[$0], ENAME=[$1])
    JdbcTableScan(table=[[SCOTT, EMP]])

----  Physical Plan:
 JdbcToEnumerableConverter
  JdbcProject(ENAME=[$1], EMPNO=[$0])
    JdbcSort(sort0=[$0], dir0=[ASC])
      JdbcTableScan(table=[[SCOTT, EMP]])


---- The generated SQL to query the database is:
 SELECT ENAME, EMPNOFROM (SELECT * FROM SCOTT.EMP ORDER BY EMPNO NULLS LAST) AS t


---------------------------

Example 1:
> SELECT ename FROM scott.emp ORDER BY empno limit 10

-----   Logical Plan:

  LogicalSort(sort0=[$1], dir0=[ASC], fetch=[10])
  LogicalProject(EMPNO=[$0], ENAME=[$1])
    JdbcTableScan(table=[[SCOTT, EMP]])


-----  Physical Plan:
 JdbcToEnumerableConverter
  JdbcProject(EMPNO=[$0], ENAME=[$1])
    JdbcSort(sort0=[$1], dir0=[ASC], fetch=[10])
      JdbcTableScan(table=[[SCOTT, EMP]])


---- The generated SQL to query the database is:

 SELECT EMPNO, ENAMEFROM (SELECT * FROM SCOTT.EMP ORDER BY ENAME NULLS LAST FETCH NEXT 10 ROWS ONLY) AS t0



I'm facing 3 main problems:
1. There's not a Project node that limits the projects above Sort even in the Logical Plan.
2. RelToSqlConverter does not work right with Sort
3. Sometimes JdbcProject is above JdbcSort and others is below.

Anybody can help me figure this out?
   Em sábado, 11 de novembro de 2017 09:35:40 BRST, Christian Tzolov <ct...@pivotal.io> escreveu:  
 
 @Julian,  this exactly what i would expect as logical behavior.

In fact it (consistently) behaves like this (e.g. Sort is after Project)
when the Sort is on a single column! But for some reasons if the Sort is
performed on two columns (same used in the Project) then the Sort is
performed before the Project!

This would be an expected if the Sort columns are not contained in the
Project such. So i wonder could it be that somewhere during the Sort
optimization the this column check gets confused?

On 11 November 2017 at 03:05, Julian Hyde <jh...@apache.org> wrote:

> While the cost of Project does not depend heavily on the number of input
> columns, the cost of Sort (or at least a typical Sort algorithm such as
> external merge sort) does depend on the number of columns (or more
> precisely on the average row size in bytes). So, if the Project reduces the
> number of columns (as most Projects do) then the Sort will have lower cost
> if performed after the Project, because it is handling fewer bytes.
>
>
> > On Nov 10, 2017, at 10:32 AM, Luis Fernando Kauer
> <lf...@yahoo.com.br.INVALID> wrote:
> >
> > I'm trying to fix https://issues.apache.org/jira/browse/CALCITE-1906
> and I'm facing a similar problem.
> > After managing to make JdbcSort to work, sometimes the JdbcProject is
> above JdbcSort and the generated SQL is wrong because RelToSqlConverter
> uses SqlImplementator.Clause enum to decide when to create subqueries, but
> since ORDER_BY is after SELECT, once it gets to JdbcProject it can't use
> the same query because it already used ORDER_BY.
> > The rule responsable for this is SortProjectTransposeRule.  The opposite
> rule is ProjectSortTransposeRule, but this one only matches if the sort
> node is exactly Sort.class, so it ends up not matching.
> > Is pushing the Project above Sort usually a good final plan or is it
> done to allow other rules to match?  If it is not, maybe we should solve
> this in the core project.
> >
> >
> >
> >    Em sexta-feira, 10 de novembro de 2017 15:57:34 BRST, Michael Mior <
> mmior@uwaterloo.ca> escreveu:
> >
> > Since the cost of the project doesn't depend on the number of columns
> being
> > projected or the size of the input, putting the project before or after
> the
> > sort will result in the same estimated cost. One approach would be to
> scale
> > the cost of the projection based on the fraction of columns projected.
> >
> > --
> > Michael Mior
> > mmior@apache.org
> >
> > 2017-11-10 12:42 GMT-05:00 Christian Tzolov <ct...@pivotal.io>:
> >
> >> I've observed in my
> >> no-sql adapter
> >>  implementation that for q
> >> ueries with
> >> P
> >> roject
> >>  +
> >> S
> >> ort by
> >> ONE
> >>  column
> >> t
> >> he
> >> Project
> >>
> >> is pushed (as expected) 
> >> before the Sort but for Sort
> >> on MULTIPLE
> >>  columns
> >> the Sort is before the Project.
> >> For example
> >> for a query with one
> >> sort column:
> >>
> >> SELECT yearPublished FROM BookMaster ORDER BY yearPublished ASC
> >>
> >> The plan looks like expected (project before the sort)
> >>
> >>
> >> PLAN=GeodeToEnumerableConverterRel
> >>  *GeodeSortRel*(sort0=[$0], dir0=[ASC])
> >>        GeodeProjectRel(yearPublished=[$2])
> >>            GeodeTableScanRel(table=[[TEST, BookMaster]])
> >>
> >> But
> >>  for sort
> >> with
> >> two
> >> 
> >> columns:
> >>
> >> SELECT yearPublished, itemNumber from BookMaster ORDER BY yearPublished
> >> ASC, itemNumber ASC
> >>
> >> The
> >> the plan is:
> >>
> >>
> >> PLAN=GeodeToEnumerableConverterRel
> >>  GeodeProjectRel(yearPublished=[$2], itemNumber=[$0])
> >>    *GeodeSortRel*(sort0=[$2], sort1=[$0], dir0=[ASC], dir1=[ASC])
> >>      GeodeTableScanRel(table=[[TEST, BookMaster]])
> >>
> >> I'm not sure i can explain
> >>  why in the second case the Sort appears before the Project?
> >> Here
> >> are my cost functions:
> >>
> >> * 
> >> GeodeSortRel
> >> :
> >> https://github.com/tzolov/calcite/blob/geode-1.3/geode/
> >> src/main/java/org/apache/calcite/adapter/geode/rel/
> GeodeSortRel.java#L51
> >>
> >> * GoedeProjectRel:
> >> https://github.com/tzolov/calcite/blob/4a631d9055340f64f5e644454551f9
> >> 64ea08f9de/geode/src/main/java/org/apache/calcite/adapter/geode/rel/
> >> GeodeProjectRel.java#L52
> >> 
> >>
> >> 
> >> Cheers,
> >> Christian
>
>


-- 
Christian Tzolov <http://www.linkedin.com/in/tzolov> | Principle Software
Engineer | Pivotal <http://pivotal.io/> | ctzolov@pivotal.io |+31610285517

Re: Project + Sort on single and on multiple columns

Posted by Christian Tzolov <ct...@pivotal.io>.

@Julian,  this exactly what i would expect as logical behavior.

In fact it (consistently) behaves like this (e.g. Sort is after Project)
when the Sort is on a single column! But for some reasons if the Sort is
performed on two columns (same used in the Project) then the Sort is
performed before the Project!

This would be an expected if the Sort columns are not contained in the
Project such. So i wonder could it be that somewhere during the Sort
optimization the this column check gets confused?

On 11 November 2017 at 03:05, Julian Hyde <jh...@apache.org> wrote:

> While the cost of Project does not depend heavily on the number of input
> columns, the cost of Sort (or at least a typical Sort algorithm such as
> external merge sort) does depend on the number of columns (or more
> precisely on the average row size in bytes). So, if the Project reduces the
> number of columns (as most Projects do) then the Sort will have lower cost
> if performed after the Project, because it is handling fewer bytes.
>
>
> > On Nov 10, 2017, at 10:32 AM, Luis Fernando Kauer
> <lf...@yahoo.com.br.INVALID> wrote:
> >
> > I'm trying to fix https://issues.apache.org/jira/browse/CALCITE-1906
> and I'm facing a similar problem.
> > After managing to make JdbcSort to work, sometimes the JdbcProject is
> above JdbcSort and the generated SQL is wrong because RelToSqlConverter
> uses SqlImplementator.Clause enum to decide when to create subqueries, but
> since ORDER_BY is after SELECT, once it gets to JdbcProject it can't use
> the same query because it already used ORDER_BY.
> > The rule responsable for this is SortProjectTransposeRule.  The opposite
> rule is ProjectSortTransposeRule, but this one only matches if the sort
> node is exactly Sort.class, so it ends up not matching.
> > Is pushing the Project above Sort usually a good final plan or is it
> done to allow other rules to match?  If it is not, maybe we should solve
> this in the core project.
> >
> >
> >
> >    Em sexta-feira, 10 de novembro de 2017 15:57:34 BRST, Michael Mior <
> mmior@uwaterloo.ca> escreveu:
> >
> > Since the cost of the project doesn't depend on the number of columns
> being
> > projected or the size of the input, putting the project before or after
> the
> > sort will result in the same estimated cost. One approach would be to
> scale
> > the cost of the projection based on the fraction of columns projected.
> >
> > --
> > Michael Mior
> > mmior@apache.org
> >
> > 2017-11-10 12:42 GMT-05:00 Christian Tzolov <ct...@pivotal.io>:
> >
> >> I've observed in my
> >> no-sql adapter
> >>  implementation that for q
> >> ueries with
> >> P
> >> roject
> >>  +
> >> S
> >> ort by
> >> ONE
> >>   column
> >> t
> >> he
> >> Project
> >>
> >> is pushed (as expected) 
> >> before the Sort but for Sort
> >> on MULTIPLE
> >>   columns
> >> the Sort is before the Project.
> >> For example
> >> for a query with one
> >> sort column:
> >>
> >> SELECT yearPublished FROM BookMaster ORDER BY yearPublished ASC
> >>
> >> The plan looks like expected (project before the sort)
> >>
> >>
> >> PLAN=GeodeToEnumerableConverterRel
> >>   *GeodeSortRel*(sort0=[$0], dir0=[ASC])
> >>         GeodeProjectRel(yearPublished=[$2])
> >>             GeodeTableScanRel(table=[[TEST, BookMaster]])
> >>
> >> But
> >>  for sort
> >> with
> >> two
> >> 
> >> columns:
> >>
> >> SELECT yearPublished, itemNumber from BookMaster ORDER BY yearPublished
> >> ASC, itemNumber ASC
> >>
> >> The
> >> the plan is:
> >>
> >>
> >> PLAN=GeodeToEnumerableConverterRel
> >>   GeodeProjectRel(yearPublished=[$2], itemNumber=[$0])
> >>     *GeodeSortRel*(sort0=[$2], sort1=[$0], dir0=[ASC], dir1=[ASC])
> >>       GeodeTableScanRel(table=[[TEST, BookMaster]])
> >>
> >> I'm not sure i can explain
> >>  why in the second case the Sort appears before the Project?
> >> Here
> >> are my cost functions:
> >>
> >> * 
> >> GeodeSortRel
> >> :
> >> https://github.com/tzolov/calcite/blob/geode-1.3/geode/
> >> src/main/java/org/apache/calcite/adapter/geode/rel/
> GeodeSortRel.java#L51
> >>
> >> * GoedeProjectRel:
> >> https://github.com/tzolov/calcite/blob/4a631d9055340f64f5e644454551f9
> >> 64ea08f9de/geode/src/main/java/org/apache/calcite/adapter/geode/rel/
> >> GeodeProjectRel.java#L52
> >> 
> >>
> >> 
> >> Cheers,
> >> Christian
>
>


-- 
Christian Tzolov <http://www.linkedin.com/in/tzolov> | Principle Software
Engineer | Pivotal <http://pivotal.io/> | ctzolov@pivotal.io |+31610285517

Re: Project + Sort on single and on multiple columns

Posted by Julian Hyde <jh...@apache.org>.

While the cost of Project does not depend heavily on the number of input columns, the cost of Sort (or at least a typical Sort algorithm such as external merge sort) does depend on the number of columns (or more precisely on the average row size in bytes). So, if the Project reduces the number of columns (as most Projects do) then the Sort will have lower cost if performed after the Project, because it is handling fewer bytes.


> On Nov 10, 2017, at 10:32 AM, Luis Fernando Kauer <lf...@yahoo.com.br.INVALID> wrote:
> 
> I'm trying to fix https://issues.apache.org/jira/browse/CALCITE-1906 and I'm facing a similar problem.
> After managing to make JdbcSort to work, sometimes the JdbcProject is above JdbcSort and the generated SQL is wrong because RelToSqlConverter uses SqlImplementator.Clause enum to decide when to create subqueries, but since ORDER_BY is after SELECT, once it gets to JdbcProject it can't use the same query because it already used ORDER_BY.
> The rule responsable for this is SortProjectTransposeRule.  The opposite rule is ProjectSortTransposeRule, but this one only matches if the sort node is exactly Sort.class, so it ends up not matching.
> Is pushing the Project above Sort usually a good final plan or is it done to allow other rules to match?  If it is not, maybe we should solve this in the core project.
> 
> 
> 
>    Em sexta-feira, 10 de novembro de 2017 15:57:34 BRST, Michael Mior <mm...@uwaterloo.ca> escreveu:  
> 
> Since the cost of the project doesn't depend on the number of columns being
> projected or the size of the input, putting the project before or after the
> sort will result in the same estimated cost. One approach would be to scale
> the cost of the projection based on the fraction of columns projected.
> 
> --
> Michael Mior
> mmior@apache.org
> 
> 2017-11-10 12:42 GMT-05:00 Christian Tzolov <ct...@pivotal.io>:
> 
>> I've observed in my
>> no-sql adapter
>>  implementation that for q
>> ueries with
>> P
>> roject
>>  +
>> S
>> ort by
>> ONE
>>   column
>> t
>> he
>> Project
>> 
>> is pushed (as expected) 
>> before the Sort but for Sort
>> on MULTIPLE
>>   columns
>> the Sort is before the Project.
>> For example
>> for a query with one
>> sort column:
>> 
>> SELECT yearPublished FROM BookMaster ORDER BY yearPublished ASC
>> 
>> The plan looks like expected (project before the sort)
>> 
>> 
>> PLAN=GeodeToEnumerableConverterRel
>>   *GeodeSortRel*(sort0=[$0], dir0=[ASC])
>>         GeodeProjectRel(yearPublished=[$2])
>>             GeodeTableScanRel(table=[[TEST, BookMaster]])
>> 
>> But
>>  for sort
>> with
>> two
>> 
>> columns:
>> 
>> SELECT yearPublished, itemNumber from BookMaster ORDER BY yearPublished
>> ASC, itemNumber ASC
>> 
>> The
>> the plan is:
>> 
>> 
>> PLAN=GeodeToEnumerableConverterRel
>>   GeodeProjectRel(yearPublished=[$2], itemNumber=[$0])
>>     *GeodeSortRel*(sort0=[$2], sort1=[$0], dir0=[ASC], dir1=[ASC])
>>       GeodeTableScanRel(table=[[TEST, BookMaster]])
>> 
>> I'm not sure i can explain
>>  why in the second case the Sort appears before the Project?
>> Here
>> are my cost functions:
>> 
>> * 
>> GeodeSortRel
>> :
>> https://github.com/tzolov/calcite/blob/geode-1.3/geode/
>> src/main/java/org/apache/calcite/adapter/geode/rel/GeodeSortRel.java#L51
>> 
>> * GoedeProjectRel:
>> https://github.com/tzolov/calcite/blob/4a631d9055340f64f5e644454551f9
>> 64ea08f9de/geode/src/main/java/org/apache/calcite/adapter/geode/rel/
>> GeodeProjectRel.java#L52
>> 
>> 
>> 
>> Cheers,
>> Christian

Re: Project + Sort on single and on multiple columns

Posted by Luis Fernando Kauer <lf...@yahoo.com.br.INVALID>.

 I'm trying to fix https://issues.apache.org/jira/browse/CALCITE-1906 and I'm facing a similar problem.
After managing to make JdbcSort to work, sometimes the JdbcProject is above JdbcSort and the generated SQL is wrong because RelToSqlConverter uses SqlImplementator.Clause enum to decide when to create subqueries, but since ORDER_BY is after SELECT, once it gets to JdbcProject it can't use the same query because it already used ORDER_BY.
The rule responsable for this is SortProjectTransposeRule.  The opposite rule is ProjectSortTransposeRule, but this one only matches if the sort node is exactly Sort.class, so it ends up not matching.
Is pushing the Project above Sort usually a good final plan or is it done to allow other rules to match?  If it is not, maybe we should solve this in the core project.



    Em sexta-feira, 10 de novembro de 2017 15:57:34 BRST, Michael Mior <mm...@uwaterloo.ca> escreveu:  
 
 Since the cost of the project doesn't depend on the number of columns being
projected or the size of the input, putting the project before or after the
sort will result in the same estimated cost. One approach would be to scale
the cost of the projection based on the fraction of columns projected.

--
Michael Mior
mmior@apache.org

2017-11-10 12:42 GMT-05:00 Christian Tzolov <ct...@pivotal.io>:

> I've observed in my
> no-sql adapter
>  implementation that for q
> ueries with
> P
> roject
>  +
> S
> ort by
> ONE
>  column
> t
> he
> Project
>
> is pushed (as expected) 
> before the Sort but for Sort
> on MULTIPLE
>  columns
> the Sort is before the Project.
> For example
> for a query with one
> sort column:
>
> SELECT yearPublished FROM BookMaster ORDER BY yearPublished ASC
>
> The plan looks like expected (project before the sort)
>
>
> PLAN=GeodeToEnumerableConverterRel
>  *GeodeSortRel*(sort0=[$0], dir0=[ASC])
>        GeodeProjectRel(yearPublished=[$2])
>            GeodeTableScanRel(table=[[TEST, BookMaster]])
>
> But
>  for sort
> with
> two
> 
> columns:
>
> SELECT yearPublished, itemNumber from BookMaster ORDER BY yearPublished
> ASC, itemNumber ASC
>
> The
> the plan is:
>
>
> PLAN=GeodeToEnumerableConverterRel
>  GeodeProjectRel(yearPublished=[$2], itemNumber=[$0])
>    *GeodeSortRel*(sort0=[$2], sort1=[$0], dir0=[ASC], dir1=[ASC])
>      GeodeTableScanRel(table=[[TEST, BookMaster]])
>
> I'm not sure i can explain
>  why in the second case the Sort appears before the Project?
> Here
> are my cost functions:
>
> * 
> GeodeSortRel
> :
> https://github.com/tzolov/calcite/blob/geode-1.3/geode/
> src/main/java/org/apache/calcite/adapter/geode/rel/GeodeSortRel.java#L51
>
> * GoedeProjectRel:
> https://github.com/tzolov/calcite/blob/4a631d9055340f64f5e644454551f9
> 64ea08f9de/geode/src/main/java/org/apache/calcite/adapter/geode/rel/
> GeodeProjectRel.java#L52
> 
>
> 
> Cheers,
> Christian
>

Re: Project + Sort on single and on multiple columns

Posted by Michael Mior <mm...@uwaterloo.ca>.

Yes, the project will reduce the amount of data required to be sorted, but
the cost model needs to reflect this. Although my mistake, Project already
implements this behaviour by multiplying the number of rows by the
estimated number of rows by the number of projected columns.

In short, not sure why you're experiencing this behaviour but maybe others
have some insights.

--
Michael Mior
mmior@apache.org

2017-11-10 13:07 GMT-05:00 Christian Tzolov <ct...@pivotal.io>:

> Not sure i understand. Lets say that we have rows of size 10MB each. IMO it
> is not the same if we perform the sort on the complete row (e.g. 10MB) or
> on subset left after the Project. E.g. my intuition is that the Project
> will shrink the size and therefore should be performed before the Sort (As
> it does for single column sort)
> Or is this intuition wrong?
>
>
> On 10 November 2017 at 18:57, Michael Mior <mm...@uwaterloo.ca> wrote:
>
> > Since the cost of the project doesn't depend on the number of columns
> being
> > projected or the size of the input, putting the project before or after
> the
> > sort will result in the same estimated cost. One approach would be to
> scale
> > the cost of the projection based on the fraction of columns projected.
> >
> > --
> > Michael Mior
> > mmior@apache.org
> >
> > 2017-11-10 12:42 GMT-05:00 Christian Tzolov <ct...@pivotal.io>:
> >
> > > I've observed in my
> > > no-sql adapter
> > >  implementation that for q
> > > ueries with
> > > P
> > > roject
> > >  +
> > > S
> > > ort by
> > > ONE
> > >  column
> > > t
> > > he
> > > Project
> > >
> > > is pushed (as expected) 
> > > before the Sort but for Sort
> > > on MULTIPLE
> > >  columns
> > > the Sort is before the Project.
> > > For example
> > > for a query with one
> > > sort column:
> > >
> > > SELECT yearPublished FROM BookMaster ORDER BY yearPublished ASC
> > >
> > > The plan looks like expected (project before the sort)
> > >
> > >
> > > PLAN=GeodeToEnumerableConverterRel
> > >   *GeodeSortRel*(sort0=[$0], dir0=[ASC])
> > >        GeodeProjectRel(yearPublished=[$2])
> > >            GeodeTableScanRel(table=[[TEST, BookMaster]])
> > >
> > > But
> > >  for sort
> > > with
> > > two
> > > 
> > > columns:
> > >
> > > SELECT yearPublished, itemNumber from BookMaster ORDER BY yearPublished
> > > ASC, itemNumber ASC
> > >
> > > The
> > > the plan is:
> > >
> > >
> > > PLAN=GeodeToEnumerableConverterRel
> > >   GeodeProjectRel(yearPublished=[$2], itemNumber=[$0])
> > >     *GeodeSortRel*(sort0=[$2], sort1=[$0], dir0=[ASC], dir1=[ASC])
> > >       GeodeTableScanRel(table=[[TEST, BookMaster]])
> > >
> > > I'm not sure i can explain
> > >  why in the second case the Sort appears before the Project?
> > > Here
> > > are my cost functions:
> > >
> > > * 
> > > GeodeSortRel
> > > :
> > > https://github.com/tzolov/calcite/blob/geode-1.3/geode/
> > > src/main/java/org/apache/calcite/adapter/geode/rel/
> GeodeSortRel.java#L51
> > >
> > > * GoedeProjectRel:
> > > https://github.com/tzolov/calcite/blob/4a631d9055340f64f5e644454551f9
> > > 64ea08f9de/geode/src/main/java/org/apache/calcite/adapter/geode/rel/
> > > GeodeProjectRel.java#L52
> > > 
> > >
> > > 
> > > Cheers,
> > > Christian
> > >
> >
>
>
>
> --
> Christian Tzolov <http://www.linkedin.com/in/tzolov> | Principle Software
> Engineer | Pivotal <http://pivotal.io/> | ctzolov@pivotal.io |+31610285517
>

Re: Project + Sort on single and on multiple columns

Posted by Christian Tzolov <ct...@pivotal.io>.

Not sure i understand. Lets say that we have rows of size 10MB each. IMO it
is not the same if we perform the sort on the complete row (e.g. 10MB) or
on subset left after the Project. E.g. my intuition is that the Project
will shrink the size and therefore should be performed before the Sort (As
it does for single column sort)
Or is this intuition wrong?


On 10 November 2017 at 18:57, Michael Mior <mm...@uwaterloo.ca> wrote:

> Since the cost of the project doesn't depend on the number of columns being
> projected or the size of the input, putting the project before or after the
> sort will result in the same estimated cost. One approach would be to scale
> the cost of the projection based on the fraction of columns projected.
>
> --
> Michael Mior
> mmior@apache.org
>
> 2017-11-10 12:42 GMT-05:00 Christian Tzolov <ct...@pivotal.io>:
>
> > I've observed in my
> > no-sql adapter
> >  implementation that for q
> > ueries with
> > P
> > roject
> >  +
> > S
> > ort by
> > ONE
> >  column
> > t
> > he
> > Project
> >
> > is pushed (as expected) 
> > before the Sort but for Sort
> > on MULTIPLE
> >  columns
> > the Sort is before the Project.
> > For example
> > for a query with one
> > sort column:
> >
> > SELECT yearPublished FROM BookMaster ORDER BY yearPublished ASC
> >
> > The plan looks like expected (project before the sort)
> >
> >
> > PLAN=GeodeToEnumerableConverterRel
> >   *GeodeSortRel*(sort0=[$0], dir0=[ASC])
> >        GeodeProjectRel(yearPublished=[$2])
> >            GeodeTableScanRel(table=[[TEST, BookMaster]])
> >
> > But
> >  for sort
> > with
> > two
> > 
> > columns:
> >
> > SELECT yearPublished, itemNumber from BookMaster ORDER BY yearPublished
> > ASC, itemNumber ASC
> >
> > The
> > the plan is:
> >
> >
> > PLAN=GeodeToEnumerableConverterRel
> >   GeodeProjectRel(yearPublished=[$2], itemNumber=[$0])
> >     *GeodeSortRel*(sort0=[$2], sort1=[$0], dir0=[ASC], dir1=[ASC])
> >       GeodeTableScanRel(table=[[TEST, BookMaster]])
> >
> > I'm not sure i can explain
> >  why in the second case the Sort appears before the Project?
> > Here
> > are my cost functions:
> >
> > * 
> > GeodeSortRel
> > :
> > https://github.com/tzolov/calcite/blob/geode-1.3/geode/
> > src/main/java/org/apache/calcite/adapter/geode/rel/GeodeSortRel.java#L51
> >
> > * GoedeProjectRel:
> > https://github.com/tzolov/calcite/blob/4a631d9055340f64f5e644454551f9
> > 64ea08f9de/geode/src/main/java/org/apache/calcite/adapter/geode/rel/
> > GeodeProjectRel.java#L52
> > 
> >
> > 
> > Cheers,
> > Christian
> >
>



-- 
Christian Tzolov <http://www.linkedin.com/in/tzolov> | Principle Software
Engineer | Pivotal <http://pivotal.io/> | ctzolov@pivotal.io |+31610285517

Re: Project + Sort on single and on multiple columns

Posted by Michael Mior <mm...@uwaterloo.ca>.

Since the cost of the project doesn't depend on the number of columns being
projected or the size of the input, putting the project before or after the
sort will result in the same estimated cost. One approach would be to scale
the cost of the projection based on the fraction of columns projected.

--
Michael Mior
mmior@apache.org

2017-11-10 12:42 GMT-05:00 Christian Tzolov <ct...@pivotal.io>:

> I've observed in my
> no-sql adapter
>  implementation that for q
> ueries with
> P
> roject
>  +
> S
> ort by
> ONE
>  column
> t
> he
> Project
>
> is pushed (as expected) 
> before the Sort but for Sort
> on MULTIPLE
>  columns
> the Sort is before the Project.
> For example
> for a query with one
> sort column:
>
> SELECT yearPublished FROM BookMaster ORDER BY yearPublished ASC
>
> The plan looks like expected (project before the sort)
>
>
> PLAN=GeodeToEnumerableConverterRel
>   *GeodeSortRel*(sort0=[$0], dir0=[ASC])
>        GeodeProjectRel(yearPublished=[$2])
>            GeodeTableScanRel(table=[[TEST, BookMaster]])
>
> But
>  for sort
> with
> two
> 
> columns:
>
> SELECT yearPublished, itemNumber from BookMaster ORDER BY yearPublished
> ASC, itemNumber ASC
>
> The
> the plan is:
>
>
> PLAN=GeodeToEnumerableConverterRel
>   GeodeProjectRel(yearPublished=[$2], itemNumber=[$0])
>     *GeodeSortRel*(sort0=[$2], sort1=[$0], dir0=[ASC], dir1=[ASC])
>       GeodeTableScanRel(table=[[TEST, BookMaster]])
>
> I'm not sure i can explain
>  why in the second case the Sort appears before the Project?
> Here
> are my cost functions:
>
> * 
> GeodeSortRel
> :
> https://github.com/tzolov/calcite/blob/geode-1.3/geode/
> src/main/java/org/apache/calcite/adapter/geode/rel/GeodeSortRel.java#L51
>
> * GoedeProjectRel:
> https://github.com/tzolov/calcite/blob/4a631d9055340f64f5e644454551f9
> 64ea08f9de/geode/src/main/java/org/apache/calcite/adapter/geode/rel/
> GeodeProjectRel.java#L52
> 
>
> 
> Cheers,
> Christian
>