You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by weijie tong <to...@gmail.com> on 2016/10/21 08:01:10 UTC

questions about carbondata

1,what's the relation ship between these term?
 carbondata file ,block, blocklet ,carbondata file footer ? once we have a
batch job to load data into a carbondata table, does that mean the table
file will be composed by different blocks ,and each block is a carbondata
file  which is composed by many blocklets ,and one FileFooter  according to
the carbondata file format ?

2, how does the column data store as inverted index?
 invert the dim column data to what ? how does inverted index affect a
query ?

3. does all the blocklets store sequence according to the sorted mdk  key ?

hope someone can give a detail answer.

回复： questions about carbondata

Posted by 杰 <25...@qq.com>.

hi,  

   for 3, blocklets are not stored sequence in global, neither in block local. 
actually, we can say that blocklets are sorted in partition, and one partition has 
many blocks. this word 'partition' is just exactly spark's partition, because
carbon makes further process in spark executor, so that one spark partition will 
have many carbon blocks. though carbon's mkdkey is not sorted in global, while carbon dictionary is global,
so global dictionary + sorted in partition should make carbon not much difference with Hbase.
   as for index file, carbondataindex file contains blocks index info, while the footer in carbondata file contain blocklets index info,
that's  2 level for driver filter and executor filter.

Thanks
Jay

------------------ 原始邮件 ------------------
发件人: "weijie tong";<to...@gmail.com>;
发送时间: 2016年10月22日(星期六) 中午12:30
收件人: "dev"<de...@carbondata.incubator.apache.org>; 

主题: Re: questions about carbondata

tks for the reply, for 3,I still want to know that whether all the  blocklets
of all the blocks store sequence according to the sorted mdk  key? if so ,
the global sequence mdk key of the carbon table would behave like what
hbase rowkey does . or the sequence is block local ,the block index file
manage the block level index?

On Fri, Oct 21, 2016 at 5:48 PM, 杰 <25...@qq.com> wrote:

> hi,
> 1. correct.
>    one carbon file is same as one block, one block has many blocklets as
> well as one file footer which has metadata(btree index) of blocklets.
>    one load makes one segment,in one segment has many blocks.
> 2. carbon will sort dim column data in one blocklet,  then the row
> sequence will lost, so carbon will store  dim column data as will as row id
> together,
>    and dim column data sorted and row id sequence changed correspondingly
> , so the matchup(like Array: index => data) is kept.
>    when query, carbon will first get  the expected dim column data (based
> on filter), then accorfing to matchup to get row id.
>    then based on the row id, we can get measure data.
>    so the column data is called as inverted index, which means data =>
> index, not index => data.
> 3. yes.
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "weijie tong";<to...@gmail.com>;
> 发送时间: 2016年10月21日(星期五) 下午4:01
> 收件人: "dev"<de...@carbondata.incubator.apache.org>;
>
> 主题: questions about carbondata
>
>
>
> 1,what's the relation ship between these term?
>  carbondata file ,block, blocklet ,carbondata file footer ? once we have a
> batch job to load data into a carbondata table, does that mean the table
> file will be composed by different blocks ,and each block is a carbondata
> file  which is composed by many blocklets ,and one FileFooter  according to
> the carbondata file format ?
>
> 2, how does the column data store as inverted index?
>  invert the dim column data to what ? how does inverted index affect a
> query ?
>
> 3. does all the blocklets store sequence according to the sorted mdk  key ?
>
> hope someone can give a detail answer.
>

Re: questions about carbondata

Posted by weijie tong <to...@gmail.com>.

tks for the reply, for 3,I still want to know that whether all the  blocklets
of all the blocks store sequence according to the sorted mdk  key? if so ,
the global sequence mdk key of the carbon table would behave like what
hbase rowkey does . or the sequence is block local ,the block index file
manage the block level index?

On Fri, Oct 21, 2016 at 5:48 PM, 杰 <25...@qq.com> wrote:

> hi,
> 1. correct.
>    one carbon file is same as one block, one block has many blocklets as
> well as one file footer which has metadata(btree index) of blocklets.
>    one load makes one segment,in one segment has many blocks.
> 2. carbon will sort dim column data in one blocklet,  then the row
> sequence will lost, so carbon will store  dim column data as will as row id
> together,
>    and dim column data sorted and row id sequence changed correspondingly
> , so the matchup(like Array: index => data) is kept.
>    when query, carbon will first get  the expected dim column data (based
> on filter), then accorfing to matchup to get row id.
>    then based on the row id, we can get measure data.
>    so the column data is called as inverted index, which means data =>
> index, not index => data.
> 3. yes.
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "weijie tong";<to...@gmail.com>;
> 发送时间: 2016年10月21日(星期五) 下午4:01
> 收件人: "dev"<de...@carbondata.incubator.apache.org>;
>
> 主题: questions about carbondata
>
>
>
> 1,what's the relation ship between these term?
>  carbondata file ,block, blocklet ,carbondata file footer ? once we have a
> batch job to load data into a carbondata table, does that mean the table
> file will be composed by different blocks ,and each block is a carbondata
> file  which is composed by many blocklets ,and one FileFooter  according to
> the carbondata file format ?
>
> 2, how does the column data store as inverted index?
>  invert the dim column data to what ? how does inverted index affect a
> query ?
>
> 3. does all the blocklets store sequence according to the sorted mdk  key ?
>
> hope someone can give a detail answer.
>

回复：questions about carbondata

Posted by 杰 <25...@qq.com>.

hi,
1. correct.
one carbon file is same as one block, one block has many blocklets as well as one file footer which has metadata(btree index) of blocklets.
one load makes one segment,in one segment has many blocks.
2. carbon will sort dim column data in one blocklet, then the row sequence will lost, so carbon will store dim column data as will as row id together,
and dim column data sorted and row id sequence changed correspondingly , so the matchup(like Array: index => data) is kept.
when query, carbon will first get the expected dim column data (based on filter), then accorfing to matchup to get row id.
then based on the row id, we can get measure data.
so the column data is called as inverted index, which means data => index, not index => data.
3. yes.

------------------ 原始邮件 ------------------
发件人: "weijie tong";<to...@gmail.com>;
发送时间: 2016年10月21日(星期五) 下午4:01
收件人: "dev"<de...@carbondata.incubator.apache.org>;

主题: questions about carbondata

1,what's the relation ship between these term?
carbondata file ,block, blocklet ,carbondata file footer ? once we have a
batch job to load data into a carbondata table, does that mean the table
file will be composed by different blocks ,and each block is a carbondata
file which is composed by many blocklets ,and one FileFooter according to
the carbondata file format ?

2, how does the column data store as inverted index?
invert the dim column data to what ? how does inverted index affect a
query ?

3. does all the blocklets store sequence according to the sorted mdk key ?

hope someone can give a detail answer.