You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by Abhishek Sinha <ab...@gmail.com> on 2015/03/17 06:38:00 UTC

Fact Table Distinct columns and Row Key

Hi,

Can anyone explain the two steps in the cube build process?

1. Why do we need to extract the distinct columns from Fact Table or
calculate the HIVE table cardinality?


2. What is the use of RowKey? How is it calculated? How does it help in
calculating HTable Region splits?


Is there any documentation available on these? Or any research paper/book
referred during the project?

Re: Fact Table Distinct columns and Row Key

Posted by Luke Han <lu...@gmail.com>.
Awesome Shaofeng, could you please help to add these to our FAQ page?

Thanks.


Best Regards!
---------------------

Luke Han

2015-03-17 2:14 GMT-07:00 Abhishek Sinha <ab...@infoworks.io>:

> Thanks. Good one :)
>
> On Tue, Mar 17, 2015 at 11:52 AM, hongbin ma <ma...@apache.org> wrote:
>
> > it is quite a neat explanation of RowKey:)
> >
> > On Mon, Mar 16, 2015 at 11:15 PM, Shi, Shaofeng <sh...@ebay.com>
> wrote:
> >
> > > Piece of my knowledge on Kylin:
> > >
> > > On 3/17/15, 1:38 PM, "Abhishek Sinha" <ab...@gmail.com>
> > wrote:
> > >
> > > >Hi,
> > > >
> > > >Can anyone explain the two steps in the cube build process?
> > > >
> > > >1. Why do we need to extract the distinct columns from Fact Table or
> > > >calculate the HIVE table cardinality?
> > >
> > > Kylin builds dictionary for each column, it needs to fetch the distinct
> > > values for each column; Using dictionary will greatly reduce the
> storage
> > > size;
> > > The cardinality can optimize the row key sequence, and so to determine
> > the
> > > roadmap of cube building, which will help 1) reduce the cube building
> > time
> > > 2) reduce the cube scan range so to improve query performance
> > >
> > > >
> > > >2. What is the use of RowKey? How is it calculated? How does it help
> in
> > > >calculating HTable Region splits?
> > >
> > > RowKey is the key in Kylin¹s storage (Hbase); It is composed by the
> > > dimensions¹ values (encoded in bytes); Assume your table has dimension
> > > columns A, B, C; Their cardinality is n1, n2, n3; In the base cuboid,
> > > there will be n1*n2*n3 rows; each row¹s key is A+B+C (concat of encoded
> > > bytes); When user sends a query like ³select Š from fact group by A,
> B, C
> > > where A=XX and B=YY and C=ZZ², Kylin will use encode(XX) + encode(YY) +
> > > encode(ZZ) as the key to query hbase to get the pre-aggregated result;
> > > >
> > > >
> > > >Is there any documentation available on these? Or any research
> > paper/book
> > > >referred during the project?
> > > Check the docs here, especially the "Design Cube in Kylin.pdf" :
> > > https://github.com/KylinOLAP/Kylin/tree/master/docs
> > >
> > > >
> > >
> > >
> >
>
>
>
> --
> Abhishek Sinha
> Mobile: +919035191078
> infoworks.io
>

Re: Fact Table Distinct columns and Row Key

Posted by Abhishek Sinha <ab...@infoworks.io>.
Thanks. Good one :)

On Tue, Mar 17, 2015 at 11:52 AM, hongbin ma <ma...@apache.org> wrote:

> it is quite a neat explanation of RowKey:)
>
> On Mon, Mar 16, 2015 at 11:15 PM, Shi, Shaofeng <sh...@ebay.com> wrote:
>
> > Piece of my knowledge on Kylin:
> >
> > On 3/17/15, 1:38 PM, "Abhishek Sinha" <ab...@gmail.com>
> wrote:
> >
> > >Hi,
> > >
> > >Can anyone explain the two steps in the cube build process?
> > >
> > >1. Why do we need to extract the distinct columns from Fact Table or
> > >calculate the HIVE table cardinality?
> >
> > Kylin builds dictionary for each column, it needs to fetch the distinct
> > values for each column; Using dictionary will greatly reduce the storage
> > size;
> > The cardinality can optimize the row key sequence, and so to determine
> the
> > roadmap of cube building, which will help 1) reduce the cube building
> time
> > 2) reduce the cube scan range so to improve query performance
> >
> > >
> > >2. What is the use of RowKey? How is it calculated? How does it help in
> > >calculating HTable Region splits?
> >
> > RowKey is the key in Kylin¹s storage (Hbase); It is composed by the
> > dimensions¹ values (encoded in bytes); Assume your table has dimension
> > columns A, B, C; Their cardinality is n1, n2, n3; In the base cuboid,
> > there will be n1*n2*n3 rows; each row¹s key is A+B+C (concat of encoded
> > bytes); When user sends a query like ³select Š from fact group by A, B, C
> > where A=XX and B=YY and C=ZZ², Kylin will use encode(XX) + encode(YY) +
> > encode(ZZ) as the key to query hbase to get the pre-aggregated result;
> > >
> > >
> > >Is there any documentation available on these? Or any research
> paper/book
> > >referred during the project?
> > Check the docs here, especially the "Design Cube in Kylin.pdf" :
> > https://github.com/KylinOLAP/Kylin/tree/master/docs
> >
> > >
> >
> >
>



-- 
Abhishek Sinha
Mobile: +919035191078
infoworks.io

Re: Fact Table Distinct columns and Row Key

Posted by hongbin ma <ma...@apache.org>.
it is quite a neat explanation of RowKey:)

On Mon, Mar 16, 2015 at 11:15 PM, Shi, Shaofeng <sh...@ebay.com> wrote:

> Piece of my knowledge on Kylin:
>
> On 3/17/15, 1:38 PM, "Abhishek Sinha" <ab...@gmail.com> wrote:
>
> >Hi,
> >
> >Can anyone explain the two steps in the cube build process?
> >
> >1. Why do we need to extract the distinct columns from Fact Table or
> >calculate the HIVE table cardinality?
>
> Kylin builds dictionary for each column, it needs to fetch the distinct
> values for each column; Using dictionary will greatly reduce the storage
> size;
> The cardinality can optimize the row key sequence, and so to determine the
> roadmap of cube building, which will help 1) reduce the cube building time
> 2) reduce the cube scan range so to improve query performance
>
> >
> >2. What is the use of RowKey? How is it calculated? How does it help in
> >calculating HTable Region splits?
>
> RowKey is the key in Kylin¹s storage (Hbase); It is composed by the
> dimensions¹ values (encoded in bytes); Assume your table has dimension
> columns A, B, C; Their cardinality is n1, n2, n3; In the base cuboid,
> there will be n1*n2*n3 rows; each row¹s key is A+B+C (concat of encoded
> bytes); When user sends a query like ³select Š from fact group by A, B, C
> where A=XX and B=YY and C=ZZ², Kylin will use encode(XX) + encode(YY) +
> encode(ZZ) as the key to query hbase to get the pre-aggregated result;
> >
> >
> >Is there any documentation available on these? Or any research paper/book
> >referred during the project?
> Check the docs here, especially the "Design Cube in Kylin.pdf" :
> https://github.com/KylinOLAP/Kylin/tree/master/docs
>
> >
>
>

Re: Fact Table Distinct columns and Row Key

Posted by "Shi, Shaofeng" <sh...@ebay.com>.
Piece of my knowledge on Kylin:

On 3/17/15, 1:38 PM, "Abhishek Sinha" <ab...@gmail.com> wrote:

>Hi,
>
>Can anyone explain the two steps in the cube build process?
>
>1. Why do we need to extract the distinct columns from Fact Table or
>calculate the HIVE table cardinality?

Kylin builds dictionary for each column, it needs to fetch the distinct
values for each column; Using dictionary will greatly reduce the storage
size;
The cardinality can optimize the row key sequence, and so to determine the
roadmap of cube building, which will help 1) reduce the cube building time
2) reduce the cube scan range so to improve query performance

>
>2. What is the use of RowKey? How is it calculated? How does it help in
>calculating HTable Region splits?

RowKey is the key in Kylin¹s storage (Hbase); It is composed by the
dimensions¹ values (encoded in bytes); Assume your table has dimension
columns A, B, C; Their cardinality is n1, n2, n3; In the base cuboid,
there will be n1*n2*n3 rows; each row¹s key is A+B+C (concat of encoded
bytes); When user sends a query like ³select Š from fact group by A, B, C
where A=XX and B=YY and C=ZZ², Kylin will use encode(XX) + encode(YY) +
encode(ZZ) as the key to query hbase to get the pre-aggregated result;
>
>
>Is there any documentation available on these? Or any research paper/book
>referred during the project?
Check the docs here, especially the "Design Cube in Kylin.pdf" :
https://github.com/KylinOLAP/Kylin/tree/master/docs

>