You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by "Cao, Lionel (JIRA)" <ji...@apache.org> on 2017/04/12 07:13:41 UTC
[jira] [Commented] (CARBONDATA-910) Implement Partition feature

    [ https://issues.apache.org/jira/browse/CARBONDATA-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965484#comment-15965484 ] 

Cao, Lionel commented on CARBONDATA-910:
----------------------------------------

2017-04-12 notes:
1. list partitioning should support value group；  partition by list area((China, India), (England, France), (America, Canada))
2. support add and delete, maybe rebuild in future, delete partition will delete data also；
3. data store prefer option 2, and use partitionId as taskId;
4. single level partition for first version, no composite partitioning

> Implement Partition feature
> ---------------------------
>
>                 Key: CARBONDATA-910
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-910
>             Project: CarbonData
>          Issue Type: New Feature
>          Components: core, data-load, data-query
>            Reporter: Cao, Lionel
>            Assignee: Cao, Lionel
>
> Why need partition table
> Partition table provide an option to divide table into some smaller pieces. 
> With partition table:
>       1. Data could be better managed, organized and stored. 
>       2. We can avoid full table scan in some scenario and improve query performance. (partition column in filter, 
>       multiple partition tables join in the same partition column etc.)
> Partitioning design
> Range Partitioning           
>        range partitioning maps data to partitions according to the range of partition column values, operator '<' defines non-inclusive upper bound of current partition.
> List Partitioning
>        list partitioning allows you map data to partitions with specific value list
> Hash Partitioning
>        hash partitioning maps data to partitions with hash algorithm and put them to the given number of partitions
> Composite Partitioning(2 levels at most for now)
>        Range-Range, Range-List, Range-Hash, List-Range, List-List, List-Hash, Hash-Range, Hash-List, Hash-Hash
> DDL-Create 
> Create table sales(
>      itemid long, 
>      logdate datetime, 
>      customerid int
>      ...
>      ...)
> [partition by range logdate(...)]
> [subpartition by list area(...)]
> Stored By 'carbondata'
> [tblproperties(...)];
> range partition: 
>      partition by range logdate(<  '2016-01-01', < '2017-01-01', < '2017-02-01', < '2017-03-01', < '2099-01-01')
> list partition:
>      partition by list area('Asia', 'Europe', 'North America', 'Africa', 'Oceania')
> hash partition:
>      partition by hash(itemid, 9) 
> composite partition:
>      partition by range logdate(<  '2016- -01', < '2017-01-01', < '2017-02-01', < '2017-03-01', < '2099-01-01')
>      subpartition by list area('Asia', 'Europe', 'North America', 'Africa', 'Oceania')
> DDL-Rebuild, Add
> Alter table sales rebuild partition by (range|list|hash)(...);
> Alter table salse add partition (< '2018-01-01');    #only support range partitioning, list partitioning
> Alter table salse add partition ('South America');
> #Note: No delete operation for partition, please use rebuild. 
> If need delete data, use delete statement, but the definition of partition will not be deleted.
> Partition Table Data Store
> [Option One]
> Use the current design, keep partition folder out of segments
> Fact
>    |___Part0
>    |          |___Segment_0
>    |                         |___ *******-[bucketId]-.carbondata
>    |                         |___ *******-[bucketId]-.carbondata
>    |          |___Segment_1
>    |          ...
>    |___Part1
>    |          |___Segment_0
>    |          |___Segment_1
>    |...
> [Option Two]
> remove partition folder, add partition id into file name and build btree in driver side.
> Fact
>    |___Segment_0
>    |                  |___ *******-[bucketId]-[partitionId].carbondata
>    |                  |___ *******-[bucketId]-[partitionId].carbondata
>    |___Segment_1
>    |___Segment_2
>    ...
> Pros & Cons: 
> Option one would be faster to locate target files
> Option two need to store more metadata of folders
> Partition Table MetaData Store
> partitioni info should be stored in file footer/index file and load into memory before user query.
> Relationship with Bucket
> Bucket should be lower level of partition.
> Partition Table Query
> Example:
> Select * from sales
> where logdate <= date '2016-12-01';
> User should remember to add a partition filter when write SQL on a partition table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)