You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2016/12/21 06:15:58 UTC

[jira] [Comment Edited] (DRILL-5142) TestWindowFrame.testUnboundedFollowing relies on side effects

    [ https://issues.apache.org/jira/browse/DRILL-5142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15766250#comment-15766250 ] 

Paul Rogers edited comment on DRILL-5142 at 12/21/16 6:15 AM:
--------------------------------------------------------------

Tracked down the fundamental problem to Hadoop's QuickSort algorithm used by the external sort: {{org.apache.hadoop.util.QuickSort}}. QuickSort itself is unstable as it does not guarantee to retain the order of items that compare equal.

The unit tests were forcing the issue by reducing the output batch size of the first sort, forcing the second to use a merge of multiple batches, the merge apparently is stable.

Since neither SQL more Drill make guarantees about the stability of sort, we can say that the unstable sort is not a bug; just a fact of life. Altering the queries is the correct outcome in this case.

The new, managed, external sort does away with the option to force small output batches (as doing so causes errors when using Union Vectors.) So, we will not preserve the implementation artifact in the managed sort but will fix the tests to be correct instead.


was (Author: paul-rogers):
Tracked down the fundamental problem to Hadoop's QuickSort algorithm used by the external sort: {{org.apache.hadoop.util.QuickSort}}. QuickSort itself is unstable as it does not guarantee order of items that compare equal.

The unit tests were forcing the issue by reducing the output batch size of the first sort, forcing the second to use a merge of multiple batches, the merge apparently is stable.

Since neither SQL more Drill make guarantees about the stability of sort, we can say that the unstable sort is not a bug; just a fact of life. Altering the queries is the correct outcome in this case.

The new, managed, external sort does away with the option to force small output batches (as doing so causes errors when using Union Vectors.) So, we will not preserve the implementation artifact in the managed sort but will fix the tests to be correct instead.

> TestWindowFrame.testUnboundedFollowing relies on side effects
> -------------------------------------------------------------
>
>                 Key: DRILL-5142
>                 URL: https://issues.apache.org/jira/browse/DRILL-5142
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>
> The unit test {{TestWindowFrame.testUnboundedFollowing}} is one of a family of tests that does the same query in two different ways, using the results of the second to verify the first. Unfortunately, this particular tests "works" only because it relies on undefined implementation artifacts about the way the "verification" query is run in Drill.
> Here is the query under test:
> {code}
> SELECT 
>   position_id,
>   employee_id,
>   LAST_VALUE(employee_id)
>     OVER(PARTITION BY position_id
>          ORDER by employee_id
>          RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS `last_value`
> FROM
>   dfs_test.`%s/window/b4.p4`
> {code}
> With expected results as follows:
> {code}
> 1,0,9
> ...
> 1,9,9
> {code}
> Here is the "expected results" query:
> {code}
> SELECT
>   position_id,
>   employee_id,
>   MAX(employee_id) OVER(PARTITION BY position_id) AS `last_value`
> FROM (
>   SELECT *
>   FROM dfs_test.`%s/window/b4.p4`
>   ORDER BY position_id, employee_id
> )
> {code}
> The above happens to produce the correct results only because the query executes in a single fragment. The query produces correct results with the "unmanaged" external sort, but produces the following (valid) results with the managed external sort:
> {code}
> 1,0,9
> 1,2,9
> ...
> 1,9,9
> 1,1,9
> {code}
> The query relies on the inner query sort order "showing through" to the outer query. But, if the query were distributed, the outer query would be unordered. Hence, the verification query just happened to work, but is not actually valid.
> The proper solution is to modify the verification query to move the ORDER BY to the outer query:
> {code}
> ...
> FROM (
>   SELECT *
>   FROM dfs_test.`%s/window/b4.p4`
> )
>   ORDER BY position_id, employee_id
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)