You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Jeff Hammerbacher (JIRA)" <ji...@apache.org> on 2008/11/13 02:27:44 UTC

[jira] Created: (HIVE-61) Implment ORDER BY

Implment ORDER BY
-----------------

                 Key: HIVE-61
                 URL: https://issues.apache.org/jira/browse/HIVE-61
             Project: Hadoop Hive
          Issue Type: New Feature
            Reporter: Jeff Hammerbacher


ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-61) Implment ORDER BY

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-61:
---------------------------

    Status: Patch Available  (was: Open)

incorporated zheng's comments

> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Jeff Hammerbacher
>            Assignee: Zheng Shao
>         Attachments: hive.61.1.patch, hove.61.2.patch
>
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-61) Implment ORDER BY

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700998#action_12700998 ] 

Zheng Shao commented on HIVE-61:
--------------------------------

Nit: Can you move the definition of numReducers inside the "if (qbp.getClusterByForClause(dest) != null ... "? In case the "if" is not executed, the variables don't get used, so they should not be exposed outside the "if"s.

There are 2 places that you assign values to extraMRStep - I think it's cleaner to split them into 2 different variables, each defined and assigned in the "then" and "else" clause of "if (qbp.getIsSubQ())". That makes the logic cleaner. What do you think?


{code}
@@ -2944,21 +2958,36 @@
       curr = genSelectPlan(dest, qb, curr);
       Integer limit = qbp.getDestLimit(dest);
 
+      boolean extraMRStep = true;
+      int numReducers = -1;
+      if (qbp.getOrderByForClause(dest) != null) {
+        numReducers = 1;
+        extraMRStep = false;
+      }
+
       if (qbp.getClusterByForClause(dest) != null
           || qbp.getDistributeByForClause(dest) != null
+          || qbp.getOrderByForClause(dest) != null
           || qbp.getSortByForClause(dest) != null) {
-        curr = genReduceSinkPlan(dest, qb, curr, -1);
+        curr = genReduceSinkPlan(dest, qb, curr, numReducers);
       }
 
       if (qbp.getIsSubQ()) {
         if (limit != null) {
-          curr = genLimitMapRedPlan(dest, qb, curr, limit.intValue(), false);
+          curr = genLimitMapRedPlan(dest, qb, curr, limit.intValue(), extraMRStep);
         }
       } else {
         curr = genConversionOps(dest, qb, curr);
         // exact limit can be taken care of by the fetch operator
         if (limit != null) {
-          curr = genLimitMapRedPlan(dest, qb, curr, limit.intValue(), qb.getIsQuery());
+          if (qb.getIsQuery() &&
+              qbp.getClusterByForClause(dest) == null &&
+              qbp.getSortByForClause(dest) == null)
+            extraMRStep = false;
+          else
+            extraMRStep = true;
+
+          curr = genLimitMapRedPlan(dest, qb, curr, limit.intValue(), extraMRStep);
           qb.getParseInfo().setOuterQueryLimit(limit.intValue());
         }
         curr = genFileSinkPlan(dest, qb, curr);
{code}


> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Jeff Hammerbacher
>            Assignee: Zheng Shao
>         Attachments: hive.61.1.patch
>
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-61) Implment ORDER BY

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647226#action_12647226 ] 

Zheng Shao commented on HIVE-61:
--------------------------------

We recently added the "SORT BY" clause which sorts the data in each reducer. An example query is:
insert overwrite table table2 select city, state where city = 'Chicago' from table sort by state;

If you set number of reducers to 1, then "sort by" will have the same result as "order by" (Do trim down the data size first - otherwise it will be very slow).


"ORDER BY" is not supported yet but we have a plan to support it shortly. The implementation of order by in our mind will be based on sort by: we run the query with sort by, and then mark the table as sorted with these columns in the table meta data.
Then we will be able to "merge" the sorted files from each reducer and produce a total order.


> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Jeff Hammerbacher
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-61) Implment ORDER BY

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-61:
-------------------------------

    Fix Version/s:     (was: 0.3.1)

> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Jeff Hammerbacher
>            Assignee: Zheng Shao
>             Fix For: 0.4.0
>
>         Attachments: hive.61.1.patch, hove.61.2.patch
>
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-61) Implment ORDER BY

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647484#action_12647484 ] 

Zheng Shao commented on HIVE-61:
--------------------------------

Some more details about the implementation of ORDER BY:

We will store on what columns a table (or a partition) is partitioned, and sorted (in asc/desc order), right after we insert data into that table. We will also store whether the table is ordered or not.

If a table/partition is ordered and if we do a select (that outputs the data to console) from that table, then we will need to merge-sort all files in that table/partition based on the sort order.



> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Jeff Hammerbacher
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-61) Implment ORDER BY

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689683#action_12689683 ] 

Zheng Shao commented on HIVE-61:
--------------------------------

Most of the use cases with total ordering is to get the top 10.

For getting the top 10, the current work-around is:

First store the top 10 from each partition to some temp table:
INSERT OVERWRITE tableB
REDUCE a.*
USING 'head -n 10'
AS (col1, col2, col3, col4, ...)
FROM (SELECT * FROM tableA SORT BY col3 DESC, col4 ASC) a

Second, set the #reducer to 1 and get the top 10 globally.
set mapred.reduce.tasks=1;
SELECT * FROM tableB SORT BY col3 DESC, col4 ASC LIMIT 10


> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Jeff Hammerbacher
>            Assignee: Zheng Shao
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-61) Implment ORDER BY

Posted by "Adam Kramer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700525#action_12700525 ] 

Adam Kramer commented on HIVE-61:
---------------------------------

While this issue is open, it would be lovely to have Hive throw a syntax error when a user asks it to ORDER BY...lots of people are using it and being unhappy/confused when it fails.


> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Jeff Hammerbacher
>            Assignee: Zheng Shao
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-61) Implment ORDER BY

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashish Thusoo updated HIVE-61:
------------------------------

    Component/s: Query Processor

> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Jeff Hammerbacher
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-61) Implment ORDER BY

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-61:
---------------------------

       Resolution: Fixed
    Fix Version/s: 0.4.0
                   0.3.1
     Release Note: HIVE-61. Implement Group-By. (Namit Jain via zshao)
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

Committed to both branch-0.3 and trunk. Thanks Namit!

> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Jeff Hammerbacher
>            Assignee: Zheng Shao
>             Fix For: 0.3.1, 0.4.0
>
>         Attachments: hive.61.1.patch, hove.61.2.patch
>
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HIVE-61) Implment ORDER BY

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao reassigned HIVE-61:
------------------------------

    Assignee: Zheng Shao

> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Jeff Hammerbacher
>            Assignee: Zheng Shao
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-61) Implment ORDER BY

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-61:
---------------------------

    Attachment: hove.61.2.patch

> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Jeff Hammerbacher
>            Assignee: Zheng Shao
>         Attachments: hive.61.1.patch, hove.61.2.patch
>
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-61) Implment ORDER BY

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-61:
---------------------------

    Status: Open  (was: Patch Available)

> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Jeff Hammerbacher
>            Assignee: Zheng Shao
>         Attachments: hive.61.1.patch, hove.61.2.patch
>
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-61) Implment ORDER BY

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700712#action_12700712 ] 

Jeff Hammerbacher commented on HIVE-61:
---------------------------------------

+1 to a verbose comment when the user attempts an ORDER BY.

> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Jeff Hammerbacher
>            Assignee: Zheng Shao
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-61) Implment ORDER BY

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-61:
---------------------------

    Status: Patch Available  (was: Open)

> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Jeff Hammerbacher
>            Assignee: Zheng Shao
>         Attachments: hive.61.1.patch
>
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-61) Implment ORDER BY

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-61:
---------------------------

    Attachment: hive.61.1.patch

> Implment ORDER BY
> -----------------
>
>                 Key: HIVE-61
>                 URL: https://issues.apache.org/jira/browse/HIVE-61
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Jeff Hammerbacher
>            Assignee: Zheng Shao
>         Attachments: hive.61.1.patch
>
>
> ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.