You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@carbondata.apache.org by "bill.zhou" <zg...@163.com> on 2017/03/14 15:10:41 UTC

【DISCUSS】add more index for sort columns

hi all 

  Carbon will add min/max index for sort columns which used for better
filter query. So can we add more index for the sort column to make filter
faster. 

  This is one idea which I get from anther database design.
   For example this is one student, and the column: score in the student
table which will be sorted column. And the score range is from 1 to 100. 
The table as following: 

id	name	score
1 	bill001	83
2 	bill002	84
3 	bill003	90
4 	bill004	89
5 	bill005	93
6 	bill006	76
7 	bill007	87
8 	bill008	90
9 	bill009	89
10	bill010	96
11	bill011	96
12	bill012	100
13	bill013	84
14	bill014	90
15	bill015	79
16	bill016	1
17	bill017	97
18	bill018	79
19	bill019	88
20	bill068	95
  
 After load the data into Cabron the score column will sort as following:
1	76	79	79	83	84	84	87	88	89	89	90	90	90	93	95	96	96	97	100

the min/max index is 1/100.
So for the query as following will take all the block data. 
query1:select sum(score) from student when score score > 90 
query2:select sum(score) from student when score score > 60 and score < 70. 

Following two suggestion to reduce the block scan. 
Suggestion 1: according the score range to divide into multiple small range
for example 4: 
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n8891/index1.png> 
0: meas this block has not the score rang value
1: meas this block has the score rang value
If add this index, for the query1 only need scan 1/4 data of the block and
query2 no need scan any data, directly sckip this block

Suggestion 2: record more min/max for the score, for example every 5 rows
record one min/max
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n8891/index2.png> 
If add this index for query1 only need scan 1/2 data of the block and query2
only need scan 1/4 data of the block

this is the raw idea, please Jacky, Ravindra and liang correct it whether we
can add this feature. thanks 

Regards
Bill 




--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-add-more-index-for-sort-columns-tp8891.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

Re: 【DISCUSS】add more index for sort columns

Posted by Ravindra Pesala <ra...@gmail.com>.
Hi Bill,
Min/max for measure columns are already added in V3 format. Now measure
columns filters are being added now so it does block and blocklet pruning
based on min/max to reduce IO and processing.

And as per your suggestions, column need to be sorted and maintain multiple
ranges in metadata. But if the data is sorted we can do binary search and
find out the data directly we may not require to maintain ranges in this
case.
If the data is not sorted then maintain more min/max may give some benefit.
This approach we can take as alternate approach​ for inverted indexes.

Regards,
Ravindra.

On Tue, Mar 14, 2017, 20:40 bill.zhou <zg...@163.com> wrote:

> hi all
>
>   Carbon will add min/max index for sort columns which used for better
> filter query. So can we add more index for the sort column to make filter
> faster.
>
>   This is one idea which I get from anther database design.
>    For example this is one student, and the column: score in the student
> table which will be sorted column. And the score range is from 1 to 100.
> The table as following:
>
> id      name    score
> 1       bill001 83
> 2       bill002 84
> 3       bill003 90
> 4       bill004 89
> 5       bill005 93
> 6       bill006 76
> 7       bill007 87
> 8       bill008 90
> 9       bill009 89
> 10      bill010 96
> 11      bill011 96
> 12      bill012 100
> 13      bill013 84
> 14      bill014 90
> 15      bill015 79
> 16      bill016 1
> 17      bill017 97
> 18      bill018 79
> 19      bill019 88
> 20      bill068 95
>
>  After load the data into Cabron the score column will sort as following:
> 1       76      79      79      83      84      84      87      88
> 89      89      90      90      90      93      95      96      96      97
>     100
>
> the min/max index is 1/100.
> So for the query as following will take all the block data.
> query1:select sum(score) from student when score score > 90
> query2:select sum(score) from student when score score > 60 and score < 70.
>
> Following two suggestion to reduce the block scan.
> Suggestion 1: according the score range to divide into multiple small range
> for example 4:
> <
> http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n8891/index1.png
> >
> 0: meas this block has not the score rang value
> 1: meas this block has the score rang value
> If add this index, for the query1 only need scan 1/4 data of the block and
> query2 no need scan any data, directly sckip this block
>
> Suggestion 2: record more min/max for the score, for example every 5 rows
> record one min/max
> <
> http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n8891/index2.png
> >
> If add this index for query1 only need scan 1/2 data of the block and
> query2
> only need scan 1/4 data of the block
>
> this is the raw idea, please Jacky, Ravindra and liang correct it whether
> we
> can add this feature. thanks
>
> Regards
> Bill
>
>
>
>
> --
> View this message in context:
> http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-add-more-index-for-sort-columns-tp8891.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>