You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@carbondata.apache.org by Lu Cao <wh...@gmail.com> on 2017/04/13 08:05:55 UTC

[Discussion] Implement Partition Table Feature

Hi Dev,
I've drafted a doc about implement the partition table feature, please help
review and give your advices.

https://github.com/lionelcao/CarbonData_Docs/blob/master/partition.md

Thanks,
Cao Lu

Re: [Discussion] Implement Partition Table Feature

Posted by Lu Cao <wh...@gmail.com>.
1. carbon use different sql parser in spark1.6 and 2.1, need to change
CarbonSQLParser for 1.6
2. for interval range partition, no fixed partition name is defined in DDL,
but need to keep partition name in schema and update when new partition is
added.
3. one btree for one partition and one segment in driver side

On Mon, Apr 17, 2017 at 3:29 PM, QiangCai <qi...@qq.com> wrote:

> sub-task list of Partition Table Feature:
>
> 1. Define PartitionInfo model
> modify schema.thrift to define PartitionInfo, add PartitionInfo to
> TableSchema
>
> 2. Create Table with Partition
> CarbonSparkSqlParser parse partition part to generate PartitionInfo, add
> PartitionInfo to TableModel.
>
> CreateTable add PartitionInfo to TableInfo,  store PartitionInfo in
> TableSchema
>
> 3. Data loading of partition table
> use PartitionInfo to generate Partitioner (hash, list, range)
> use Partitioner to repartition input data file, reuse loadDataFrame flow
> use partition id to replace task no in carbondata/index file name
>
> 4. Detail filter query on partition column
> support equal filter to get partition id, use this partition id to filter
> BTree.
> In the future, will support other filter(range, in...)
>
> 5. Partition tables join on partition column
>
> 6. Alter table add/drop partition
>
> Any suggestion?
>
> Best Regards,
> David QiangCai
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Discussion-
> Implement-Partition-Table-Feature-tp10938p11151.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>

Re: [Discussion] Implement Partition Table Feature

Posted by QiangCai <qi...@qq.com>.
sub-task list of Partition Table Feature:

1. Define PartitionInfo model
modify schema.thrift to define PartitionInfo, add PartitionInfo to
TableSchema

2. Create Table with Partition
CarbonSparkSqlParser parse partition part to generate PartitionInfo, add
PartitionInfo to TableModel.

CreateTable add PartitionInfo to TableInfo,  store PartitionInfo in
TableSchema

3. Data loading of partition table
use PartitionInfo to generate Partitioner (hash, list, range)
use Partitioner to repartition input data file, reuse loadDataFrame flow
use partition id to replace task no in carbondata/index file name

4. Detail filter query on partition column
support equal filter to get partition id, use this partition id to filter
BTree.
In the future, will support other filter(range, in...)

5. Partition tables join on partition column

6. Alter table add/drop partition

Any suggestion?

Best Regards,
David QiangCai



--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-tp10938p11151.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

Re: [Discussion] Implement Partition Table Feature

Posted by QiangCai <qi...@qq.com>.
Hi Cao Lu,
  I suggest to mention the following information.

1. table creation
modify schema.thrift, add optional partitioner information to TableSchema

2. alter table add/drop partition

3. data loading of partition table
use  partitioner information of TableSchema to generate the table
partitioner, then use this partitioner to repartition input RDD, finally
reuse loadDataFrame flow.

use partition id to replace task no in carbondata/inde file name, so no need
to store partition information in footer and index file, 

4. detail query on partition table with partition column filter.
use partition column filter to get partition id list, use partition id list
to filter BTree.

5. partition tables join on partition column




--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-tp10938p11139.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

Re: [Discussion] Implement Partition Table Feature

Posted by Liang Chen <ch...@gmail.com>.
Hi

Agree, we don't need to add special constrain for one column be together
existing in partition key and SORT_COLUMNS.

But from actual case, don't suggest giving partition key is same as the
first column of SORT_COLUMNS, maybe we need to add the tips to partition
feature's document.

Regards
Liang


Jacky Li wrote
>> 在 2017年4月15日,下午12:00,Jacky Li &lt;

> jacky.likun@

> &gt; 写道:
>> 
>> Hi Cao Lu,
>> 
>> The overall design likes good to me, I just have following points need to
>> confirm:
>> 1. Is there detele partition DDL?
>> 2. For the data loading part, it needs to do global shuffle before actual
>> data loading? And the partition key should not be included in
>> SORT_COLUMNS
>> option, right? If yes, I think it is better to put this constrain in the
>> document also.
> 
> After second thought, I think it is up to the user whether to put
> partition key in the SORT_COLUMNS. There should be no constrain.
> 
>> 3. For the query part, I suggest to add more description for index, like
>> how
>> B tree will be loaded into driver and many B tree will be there?
>> 4. As a further optimization, is it possible that we map the partition to
>> DataNode such that we do not need to communicate with NameNode for every
>> query? Can this mapping be considered like a cache?
>> 
>> Regards,
>> Jacky
>> 
>> 
>> --
>> View this message in context:
>> http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-tp10938p11063.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.





--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-tp10938p11321.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

Re: [Discussion] Implement Partition Table Feature

Posted by Jacky Li <ja...@qq.com>.
> 在 2017年4月15日,下午12:00,Jacky Li <ja...@qq.com> 写道:
> 
> Hi Cao Lu,
> 
> The overall design likes good to me, I just have following points need to
> confirm:
> 1. Is there detele partition DDL?
> 2. For the data loading part, it needs to do global shuffle before actual
> data loading? And the partition key should not be included in SORT_COLUMNS
> option, right? If yes, I think it is better to put this constrain in the
> document also.

After second thought, I think it is up to the user whether to put partition key in the SORT_COLUMNS. There should be no constrain.

> 3. For the query part, I suggest to add more description for index, like how
> B tree will be loaded into driver and many B tree will be there?
> 4. As a further optimization, is it possible that we map the partition to
> DataNode such that we do not need to communicate with NameNode for every
> query? Can this mapping be considered like a cache?
> 
> Regards,
> Jacky
> 
> 
> --
> View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-tp10938p11063.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.




Re: [Discussion] Implement Partition Table Feature

Posted by Jacky Li <ja...@qq.com>.
Hi Cao Lu,

The overall design likes good to me, I just have following points need to
confirm:
1. Is there detele partition DDL?
2. For the data loading part, it needs to do global shuffle before actual
data loading? And the partition key should not be included in SORT_COLUMNS
option, right? If yes, I think it is better to put this constrain in the
document also.
3. For the query part, I suggest to add more description for index, like how
B tree will be loaded into driver and many B tree will be there?
4. As a further optimization, is it possible that we map the partition to
DataNode such that we do not need to communicate with NameNode for every
query? Can this mapping be considered like a cache?

Regards,
Jacky



--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-tp10938p11063.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.