You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Abdel Hakim Deneche <ad...@maprtech.com> on 2015/01/11 06:57:37 UTC
Re: how should window function works

I've been working on a new Window operator (DRILL-1908
<https://issues.apache.org/jira/browse/DRILL-1908>), I decided to start
from scratch because I had to extend from AbstractRecordBatch instead of
AbstractSingleRecordBatch, but I did reuse lot's of code from the existing
window record batch.

DRILL-1908 comments contain lot's of informations about the design and the
work I've been doing. The new operator supports over clauses with "order
by" clauses, it also supports over clauses WITHOUT "order by" clauses,
although there is a bug (DRILL-1852
<https://issues.apache.org/jira/browse/DRILL-1852>) in optiq-drill that
prevents such queries to be planed.

I added some unit tests, using the new test framework, but I still need to
add more tests to make sure edge cases run fine.

The code as it is requires some cleaning and refactoring, and I still need
to support "frame clauses
<http://www.postgresql.org/docs/9.1/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS>",
but I designed the operator with that in mind so technically I just need to
rewrite one function to get frame clauses supported.

On Thu, Dec 11, 2014 at 3:52 PM, Abdel Hakim Deneche <ad...@maprtech.com>
wrote:

> as part of DRILL-1487 <https://issues.apache.org/jira/browse/DRILL-1487>,
> I did some research about sql window functions. I am posting here what I've
> found so far, to make sure I am getting everything right, and to discuss
> how we can fix the current implementation.
>
> a window function performs a calculation across a set of table rows that
> are somehow related to the current row. It's similar to aggregate
> functions, but doesn't cause rows to be grouped into a single output row.
> The over clause determines exactly how the rows of the query are split up
> for processing by the window function.
>
> For each row, there is a set of rows within it's partition called it's
> "window frame". Many (but not all) window functions act only on the rows of
> the window frame, rather than of the whole partition. All aggregate
> functions act on the rows of the window frame.
>
> the over clause can contain a "partition by" clause, an "order by" clause
> and a frame clause. When "partition by" is omitted there is one single
> partition that contains all rows. If "order by" is supplied then the frame
> consists of all rows from the start of the partition up through the current
> row, plus any following rows that are equal to the current row according to
> the "order by" clause. When "order by" is omitted, the default frame
> consists of all rows in the partition. The frame clause specifies the set
> of rows constituting the window frame.
>
> DRILL-1487 deals with a specific case: which rows belong to the window
> frame when an "order by" clause is available. But there are several other
> problems related to window functions in Drill:
>
> 1/ omitting "partition by", "order by" or both throws an exception.
>
> 2/ using a different field in "partition by" and "order by" doesn't
> produce the expected results. For example, the following query should
> display the rows sorted by position_id and salary, but the result are only
> sorted by position_id:
>
> SELECT employee_id, position_id, salary, AVG(salary) OVER(PARTITION BY
> position_id ORDER BY salary) FROM cp.`employee.json` LIMIT 20;
>
> 3/ frame clauses are ignored, for example the following query doesn't work
> as expected ("ROWS CURRENT ROW" has no effect):
>
> SELECT employee_id, position_id, salary, AVG(salary) OVER(PARTITION BY
> position_id ORDER BY position_id ROWS CURRENT ROW) FROM cp.`employee.json`
> LIMIT 20;
>
> 4/ starting from Drill 0.7.0 window functions throw an exception (see
> DRILL-1844 <https://issues.apache.org/jira/browse/DRILL-1844>)
>
>