You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by Luke Han <lu...@apache.org> on 2014/12/23 15:28:22 UTC

[Proposal] Kylin Streaming Cube Builder

Hi all,
    Please refer to new proposal about Kylin Streaming Cube Builder from
Branky Shao:

https://github.com/KylinOLAP/Kylin/wiki/%5BProposal%5D-Kylin-Streaming-Cube-Builder
.

    Any suggestion and idea please reply here.

    Thanks.

Luke

--------Text copy--------

Kylin Streaming Cube Builder

-- By Brank Shao <https://github.com/branky>, 2014-12-22
<https://github.com/KylinOLAP/Kylin/wiki/%5BProposal%5D-Kylin-Streaming-Cube-Builder#proposal>
Proposal

Although Kylin provides sub-second OLAP analysis latency
<http://en.wikipedia.org/wiki/Real-time_business_intelligence>, data latency
<http://en.wikipedia.org/wiki/Real-time_business_intelligence> is still
very long. Because it uses ETL batch generated Hive tables as the source to
build cubes. The ETL process usually takes hours to finish and the cube
building process itself also needs few hours. User cannot use Kylin to
analyze up-to-a-mintue data.

Currently, Kylin's cube builder uses cube metadata (defined by cube admin
or designer) to build a cube from one fact table and several dimension
tables in Hive. A cube is stored as one or multiple HTables in HBase, each
HTable is called a segment of the cube. The metadata is stored in HBase.
The dimension tables are also imported into HBase, stored as snapshots.
Kylin's query engine will parse SQL query from the client and fetch the
required data from HBase.

To reduce the data latency ultimately, we can build the cube from stream
data instead of static data(generated by ETL batch). The feasibility of
performing OLAP analysis on high volume of stream data has been studied a
decade ago (Online Analytical Processing Stream Data: Is It Feasible?
<http://www.cse.wustl.edu/~ychen/public/J82.pdf>). They proposed a feasible
method called stream_cube, which uses a *tilt time frame*, explores only
cuboids from *minimal interesting layer* and*observation layer*, and adopts
an algorithm called *popular path* to partially materialize the cube. The
study showed the approach is cost-effective and realistic.

We can ingest up-to-date data from realtime messaging system (e.g. Apache
Kafka <http://kafka.apache.org/>) and implement stream_cube as a Topology
of Apache Storm <https://storm.apache.org/> to build Kylin cube segments
continuously. This can be a solution to solve the data latency problem for
Kylin. Below is a high level architecture diagram of this solution.

Re: [Proposal] Kylin Streaming Cube Builder

Posted by Li Yang <li...@apache.org>.
Streaming is a right approach to reduce delay of cube build and provide
more realtime data from Kylin. We have received similar requirement a
couple times previously from real users. Any contribution in this area is
highly appreciated.

I like many ideas from the stream cube paper, like the tilt time frame and
partial cubing. Kylin can borrow those definitely. In particular, we may
aggregate the base cuboid on the fly and at batch cut, populate the rest
cuboids following Kylin's spanning tree to create a micro segment. Such
micro segments can be spilled to disk and later be merged at greater
granularity, e.g. hourly segment, and then be loaded into HTable to serve
queries.

Cheers
Yang

On Tue, Dec 23, 2014 at 10:28 PM, Luke Han <lu...@apache.org> wrote:

> Hi all,
>     Please refer to new proposal about Kylin Streaming Cube Builder from
> Branky Shao:
>
>
> https://github.com/KylinOLAP/Kylin/wiki/%5BProposal%5D-Kylin-Streaming-Cube-Builder
> .
>
>     Any suggestion and idea please reply here.
>
>     Thanks.
>
> Luke
>
> --------Text copy--------
>
> Kylin Streaming Cube Builder
>
> -- By Brank Shao <https://github.com/branky>, 2014-12-22
> <
> https://github.com/KylinOLAP/Kylin/wiki/%5BProposal%5D-Kylin-Streaming-Cube-Builder#proposal
> >
> Proposal
>
> Although Kylin provides sub-second OLAP analysis latency
> <http://en.wikipedia.org/wiki/Real-time_business_intelligence>, data
> latency
> <http://en.wikipedia.org/wiki/Real-time_business_intelligence> is still
> very long. Because it uses ETL batch generated Hive tables as the source to
> build cubes. The ETL process usually takes hours to finish and the cube
> building process itself also needs few hours. User cannot use Kylin to
> analyze up-to-a-mintue data.
>
> Currently, Kylin's cube builder uses cube metadata (defined by cube admin
> or designer) to build a cube from one fact table and several dimension
> tables in Hive. A cube is stored as one or multiple HTables in HBase, each
> HTable is called a segment of the cube. The metadata is stored in HBase.
> The dimension tables are also imported into HBase, stored as snapshots.
> Kylin's query engine will parse SQL query from the client and fetch the
> required data from HBase.
>
> To reduce the data latency ultimately, we can build the cube from stream
> data instead of static data(generated by ETL batch). The feasibility of
> performing OLAP analysis on high volume of stream data has been studied a
> decade ago (Online Analytical Processing Stream Data: Is It Feasible?
> <http://www.cse.wustl.edu/~ychen/public/J82.pdf>). They proposed a
> feasible
> method called stream_cube, which uses a *tilt time frame*, explores only
> cuboids from *minimal interesting layer* and*observation layer*, and adopts
> an algorithm called *popular path* to partially materialize the cube. The
> study showed the approach is cost-effective and realistic.
>
> We can ingest up-to-date data from realtime messaging system (e.g. Apache
> Kafka <http://kafka.apache.org/>) and implement stream_cube as a Topology
> of Apache Storm <https://storm.apache.org/> to build Kylin cube segments
> continuously. This can be a solution to solve the data latency problem for
> Kylin. Below is a high level architecture diagram of this solution.
>