You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "Jacques Nadeau (JIRA)" <ji...@apache.org> on 2015/10/07 22:37:26 UTC

[jira] [Created] (DRILL-3910) Leverage Calcite's Clustered Collation

Jacques Nadeau created DRILL-3910:
-------------------------------------

             Summary: Leverage Calcite's Clustered Collation
                 Key: DRILL-3910
                 URL: https://issues.apache.org/jira/browse/DRILL-3910
             Project: Apache Drill
          Issue Type: Improvement
          Components: Query Planning & Optimization
            Reporter: Jacques Nadeau


Right now streaming aggregate requires full collation. I was just talking to [~julianhyde] and he pointed out that Calcite has a version of Collation that is Clustered (similar to what MSSQL calls Segment). Realistically, Streaming aggregate only requires a clustered collation and we should switch to requiring this. We should also go through existing operators and make sure we manage whether or not the operators maintain a clustered collation. We should then be able to have flatten produce a clustered output against the carry-through fields. This will allow us to do a better job taking advantage of the clustered-ness of data for doing additional operations. Flatten should also produce data which exposes the distribution trait on the carry-through fields. This means that a query like this:

select a, count(b) from (
  select a, flatten(x) as b from t
)x
group by a

Should be executed without redistribution of data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)