You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2009/01/27 02:52:59 UTC

[jira] Created: (HIVE-252) Automatically add CLUSTER BY and set the number of reducers if the target table is declared with "CLUSTERED BY (xxx) INTO yyy BUCKETS"

Automatically add CLUSTER BY and set the number of reducers if the target table is declared with "CLUSTERED BY (xxx) INTO yyy BUCKETS"
--------------------------------------------------------------------------------------------------------------------------------------

                 Key: HIVE-252
                 URL: https://issues.apache.org/jira/browse/HIVE-252
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Query Processor
            Reporter: Zheng Shao


We should automatically add a "cluster by" clause to the following query with 64 reducers.

CREATE TABLE aaa (a BIGINT, b INT)
PARTITIONED BY(ds STRING)
CLUSTERED BY(a) INTO 64 BUCKETS 
STORED AS SEQUENCEFILE;

INSERT OVERWRITE TABLE aaa PARTITION(ds='2009-01-24')
SELECT a.a, a.b
FROM training_set a
WHERE a.ds = '2009-01-24';


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-252) Automatically add CLUSTER BY and set the number of reducers if the target table is declared with "CLUSTERED BY (xxx) INTO yyy BUCKETS"

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667571#action_12667571 ] 

Joydeep Sen Sarma commented on HIVE-252:
----------------------------------------

if we are going down this line - this should be generalized. an insert to any clustered table would have to check that the data is clustered by the right key prior to the insertion. 

the query re-write may not be so obvious in all cases where a violation of the above is detected. it would be more consistent user experience to 'suggest' the right query where possible.

there's the flip side to this - to set the clustering property on the inserted table/partition in case the query is constructed in this manner (what Adam was asking for a couple of days back)

> Automatically add CLUSTER BY and set the number of reducers if the target table is declared with "CLUSTERED BY (xxx) INTO yyy BUCKETS"
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-252
>                 URL: https://issues.apache.org/jira/browse/HIVE-252
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Zheng Shao
>
> We should automatically add a "cluster by" clause to the following query with 64 reducers.
> CREATE TABLE aaa (a BIGINT, b INT)
> PARTITIONED BY(ds STRING)
> CLUSTERED BY(a) INTO 64 BUCKETS 
> STORED AS SEQUENCEFILE;
> INSERT OVERWRITE TABLE aaa PARTITION(ds='2009-01-24')
> SELECT a.a, a.b
> FROM training_set a
> WHERE a.ds = '2009-01-24';

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-252) Automatically add CLUSTER BY and set the number of reducers if the target table is declared with "CLUSTERED BY (xxx) INTO yyy BUCKETS"

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667796#action_12667796 ] 

Ashish Thusoo commented on HIVE-252:
------------------------------------

I think what Zheng is suggesting is that if the DDL has a clustering then before the filesinkoperator we introduce the plan fragment for cluster by. 

I guess we could eliminate any non needed cluster bys from the operator plan using the tree walker so we would not incur the cost of another clustering if the data is already being clustered..

> Automatically add CLUSTER BY and set the number of reducers if the target table is declared with "CLUSTERED BY (xxx) INTO yyy BUCKETS"
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-252
>                 URL: https://issues.apache.org/jira/browse/HIVE-252
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Zheng Shao
>
> We should automatically add a "cluster by" clause to the following query with 64 reducers.
> CREATE TABLE aaa (a BIGINT, b INT)
> PARTITIONED BY(ds STRING)
> CLUSTERED BY(a) INTO 64 BUCKETS 
> STORED AS SEQUENCEFILE;
> INSERT OVERWRITE TABLE aaa PARTITION(ds='2009-01-24')
> SELECT a.a, a.b
> FROM training_set a
> WHERE a.ds = '2009-01-24';

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-252) Automatically add CLUSTER BY and set the number of reducers if the target table is declared with "CLUSTERED BY (xxx) INTO yyy BUCKETS"

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667954#action_12667954 ] 

Joydeep Sen Sarma commented on HIVE-252:
----------------------------------------

suppose i have 

... cluster by a.x insert overwrite table T select a.x+1,...;

where T is declared clustered on first column. clearly the query does not require any modification - but it will be hard to detect this in the compiler. or i am missing something.

did someone request this? (a little curious - since i think that the act of declaring a table to be clustered would be typically done by an advanced user. such users can write the correct query without a lot of compiler smarts. if, on the other hand, we want to help out the average user - we would be a lot better served by inferring and storing  the clustering property of the target table/partition automatically from the query - so that we can leverage it for future plans).

> Automatically add CLUSTER BY and set the number of reducers if the target table is declared with "CLUSTERED BY (xxx) INTO yyy BUCKETS"
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-252
>                 URL: https://issues.apache.org/jira/browse/HIVE-252
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Zheng Shao
>
> We should automatically add a "cluster by" clause to the following query with 64 reducers.
> CREATE TABLE aaa (a BIGINT, b INT)
> PARTITIONED BY(ds STRING)
> CLUSTERED BY(a) INTO 64 BUCKETS 
> STORED AS SEQUENCEFILE;
> INSERT OVERWRITE TABLE aaa PARTITION(ds='2009-01-24')
> SELECT a.a, a.b
> FROM training_set a
> WHERE a.ds = '2009-01-24';

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.