You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2012/07/16 19:21:35 UTC

[jira] [Commented] (HIVE-3244) Add table property which constraints sorting/bucketing for data loading

    [ https://issues.apache.org/jira/browse/HIVE-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415406#comment-13415406 ] 

Namit Jain commented on HIVE-3244:
----------------------------------

some initial comments
                
> Add table property which constraints sorting/bucketing for data loading
> -----------------------------------------------------------------------
>
>                 Key: HIVE-3244
>                 URL: https://issues.apache.org/jira/browse/HIVE-3244
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 0.10.0
>         Environment: ubuntu 10.10
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Minor
>
> This ticket is intended to implement "INSERT INTO" to bucketed table.
> With hive.enforce.bucketing option, user can append data to bucketed table. But current implementation depends on lexical order of file names for determining bucket number of file, which is not always true.
> So if file name is suffixed with bucket number when inserting(moving), it can be acquired rightly when it is needed, such as in BucketMapJoinOptimizer.
> With simple prototype codes, which will be attached after writing this, the test query
> {noformat}
> create table bucket_test (key int, value string) clustered by (key) sorted by (key) into 4 buckets TBLPROPERTIES
> ('FORCEDBUCKETING'='TRUE', 'FORCEDSORTING'='TRUE');
> set hive.optimize.bucketmapjoin = true;
> insert into table bucket_test select key, value from src1;
> explain extended select /*+MAPJOIN(b)*/ * from bucket_test a join bucket_test b on a.key=b.key;
> insert into table bucket_test select key, value from src1;
> explain extended select /*+MAPJOIN(b)*/ * from bucket_test a join bucket_test b on a.key=b.key;
> {noformat}
> resulted as below
> {noformat}
> 1. first plan
>  b {000000_0_[0]=[000000_0_[0]], 000001_0_[1]=[000001_0_[1]], 000002_0_[2]=[000002_0_[2]], 000003_0_[3]=[000003_0_[3]]}
> 2. second plan
>  b {000000_0_[0]=[000000_0_[0], 000000_0_copy_1_[0]], 000000_0_copy_1_[0]=[000000_0_[0], 000000_0_copy_1_[0]], 000001_0_[1]=[000001_0_[1], 000001_0_copy_1_[1]], 000001_0_copy_1_[1]=[000001_0_[1], 000001_0_copy_1_[1]], 000002_0_[2]=[000002_0_[2], 000002_0_copy_1_[2]], 000002_0_copy_1_[2]=[000002_0_[2], 000002_0_copy_1_[2]], 000003_0_[3]=[000003_0_[3], 000003_0_copy_1_[3]], 000003_0_copy_1_[3]=[000003_0_[3], 000003_0_copy_1_[3]]}
> {noformat}
> Currently, I've prevented direct loading via 'LOAD DATA' for forced bucket table. But with proper name validation, that could be allowed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira