You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by Junhai Guo <ju...@gmail.com> on 2017/04/13 14:07:38 UTC

Dual timestamp columns support in 2.0, Fact table Partitioned on ingestion date, aggregate on event date

My hive fact table is partitioned on ingestion date column, but I need to
build cube and query the cube on the actual event date column. Events can
arrive days or event weeks late. I want to build the cube incrementally
daily by specifying the ingestion date range. Does 2.0 support this
scenario and be able to build the cube efficiently without scanning the
whole fact table, and be able to merge the newly ingested data with
existing calculation?

Thanks

Jerry

答复: Dual timestamp columns support in 2.0, Fact table Partitioned on ingestion date, aggregate on event date

Posted by roger shi <ro...@hotmail.com>.

As Hive can have multiple partition columns, making actual event date column a partition column in Hive may help in this case.

________________________________
发件人: ShaoFeng Shi <sh...@apache.org>
发送时间: 2017年4月17日 11:54:48
收件人: dev
主题: Re: Dual timestamp columns support in 2.0, Fact table Partitioned on ingestion date, aggregate on event date

Billy, Junhai's question is about how to leverage Hive partition column to
avoid full table scan, when Cube's partition date isn't the same as Hive's.

This is a good point I think. In many cases the hive paritition column
isn't the Cube's partition column (one is physical, the other is logical).
If Kylin can leverage both, that would be great.

In 2.0 there is no change on this part: you couldn't specify an additional
time range. So if you want to avoid repeatedly scan full Hive table, pls
use its paritition column as Cube's, and adding "actual event date" as a
normal dimension. That will fulfill you both, although when run the query,
Kylin need scan additional segments which may lower the performance a bit.

"Specifying the ingestion date range" sounds a good idea, could you please
open a JIRA to track this? We can discuss this in detail on JIRA.


2017-04-17 11:30 GMT+08:00 Billy Liu <bi...@apache.org>:

> Hi Junhai
>
> If you want to build the late arrived data, you have to refresh to cube
> manually or calling refresh API. Kylin would not monitor the ingestion
> timestamp.
>
> 2017-04-13 22:07 GMT+08:00 Junhai Guo <ju...@gmail.com>:
>
> > My hive fact table is partitioned on ingestion date column, but I need to
> > build cube and query the cube on the actual event date column. Events can
> > arrive days or event weeks late. I want to build the cube incrementally
> > daily by specifying the ingestion date range. Does 2.0 support this
> > scenario and be able to build the cube efficiently without scanning the
> > whole fact table, and be able to merge the newly ingested data with
> > existing calculation?
> >
> > Thanks
> >
> > Jerry
> >
>



--
Best regards,

Shaofeng Shi 史少锋

Re: Dual timestamp columns support in 2.0, Fact table Partitioned on ingestion date, aggregate on event date

Posted by Billy Liu <bi...@apache.org>.

Hi Shaofeng,

Currently, each partitioned range build will generate one segment.

If Kylin could support two kinds of partitions, suppose the first one is
"partition on hive ingestion", the second one is "partition on cube
segment". The "ingestion partition" will help scan only the part of data
which is never built before. Then Kylin will process all late arrival data
to merge these late arrival data into the existing segments. This is an
refresh operation I think. Are you proposing the build and refresh approach
to address this requirement?

2017-04-17 11:54 GMT+08:00 ShaoFeng Shi <sh...@apache.org>:

> Billy, Junhai's question is about how to leverage Hive partition column to
> avoid full table scan, when Cube's partition date isn't the same as Hive's.
>
> This is a good point I think. In many cases the hive paritition column
> isn't the Cube's partition column (one is physical, the other is logical).
> If Kylin can leverage both, that would be great.
>
> In 2.0 there is no change on this part: you couldn't specify an additional
> time range. So if you want to avoid repeatedly scan full Hive table, pls
> use its paritition column as Cube's, and adding "actual event date" as a
> normal dimension. That will fulfill you both, although when run the query,
> Kylin need scan additional segments which may lower the performance a bit.
>
> "Specifying the ingestion date range" sounds a good idea, could you please
> open a JIRA to track this? We can discuss this in detail on JIRA.
>
>
> 2017-04-17 11:30 GMT+08:00 Billy Liu <bi...@apache.org>:
>
> > Hi Junhai
> >
> > If you want to build the late arrived data, you have to refresh to cube
> > manually or calling refresh API. Kylin would not monitor the ingestion
> > timestamp.
> >
> > 2017-04-13 22:07 GMT+08:00 Junhai Guo <ju...@gmail.com>:
> >
> > > My hive fact table is partitioned on ingestion date column, but I need
> to
> > > build cube and query the cube on the actual event date column. Events
> can
> > > arrive days or event weeks late. I want to build the cube incrementally
> > > daily by specifying the ingestion date range. Does 2.0 support this
> > > scenario and be able to build the cube efficiently without scanning the
> > > whole fact table, and be able to merge the newly ingested data with
> > > existing calculation?
> > >
> > > Thanks
> > >
> > > Jerry
> > >
> >
>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>

Re: Dual timestamp columns support in 2.0, Fact table Partitioned on ingestion date, aggregate on event date

Posted by ShaoFeng Shi <sh...@apache.org>.

Billy, Junhai's question is about how to leverage Hive partition column to
avoid full table scan, when Cube's partition date isn't the same as Hive's.

This is a good point I think. In many cases the hive paritition column
isn't the Cube's partition column (one is physical, the other is logical).
If Kylin can leverage both, that would be great.

In 2.0 there is no change on this part: you couldn't specify an additional
time range. So if you want to avoid repeatedly scan full Hive table, pls
use its paritition column as Cube's, and adding "actual event date" as a
normal dimension. That will fulfill you both, although when run the query,
Kylin need scan additional segments which may lower the performance a bit.

"Specifying the ingestion date range" sounds a good idea, could you please
open a JIRA to track this? We can discuss this in detail on JIRA.


2017-04-17 11:30 GMT+08:00 Billy Liu <bi...@apache.org>:

> Hi Junhai
>
> If you want to build the late arrived data, you have to refresh to cube
> manually or calling refresh API. Kylin would not monitor the ingestion
> timestamp.
>
> 2017-04-13 22:07 GMT+08:00 Junhai Guo <ju...@gmail.com>:
>
> > My hive fact table is partitioned on ingestion date column, but I need to
> > build cube and query the cube on the actual event date column. Events can
> > arrive days or event weeks late. I want to build the cube incrementally
> > daily by specifying the ingestion date range. Does 2.0 support this
> > scenario and be able to build the cube efficiently without scanning the
> > whole fact table, and be able to merge the newly ingested data with
> > existing calculation?
> >
> > Thanks
> >
> > Jerry
> >
>



-- 
Best regards,

Shaofeng Shi 史少锋

Re: Dual timestamp columns support in 2.0, Fact table Partitioned on ingestion date, aggregate on event date

Posted by Billy Liu <bi...@apache.org>.

Hi Junhai

If you want to build the late arrived data, you have to refresh to cube
manually or calling refresh API. Kylin would not monitor the ingestion
timestamp.

2017-04-13 22:07 GMT+08:00 Junhai Guo <ju...@gmail.com>:

> My hive fact table is partitioned on ingestion date column, but I need to
> build cube and query the cube on the actual event date column. Events can
> arrive days or event weeks late. I want to build the cube incrementally
> daily by specifying the ingestion date range. Does 2.0 support this
> scenario and be able to build the cube efficiently without scanning the
> whole fact table, and be able to merge the newly ingested data with
> existing calculation?
>
> Thanks
>
> Jerry
>