You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by Santoshakhilesh <sa...@huawei.com> on 2015/03/02 13:01:41 UTC

Suggestion required for Modelling with Kylin

Dear All ,

I work for a telecom major in NMS/OSS domain.

We have product which is responsible for performance monitoring of network.

I have been assigned to evaluate Kylin for multi dimension queries.

My Use case is as below.

I collect performance indicator data from network elements every 15 minutes , network elsements can be card , interface , device etc..

I have a aggergation engine which does object aggergation every 15 minutes and time aggergation evry hour , day , week , month , year.

These dynamic datas need to be shown as report to user , which might be interested in difefrent views.

a) Total traffic by area

b) Total bandwidth usage in particular area

c) Total tarffic by service type like 3G , 2G , etc...

d) Total bandwidth usage by user etc...

So basically the raw data collected at 15 minutes and time aggergated by hourly , daily , weekly , yearly , I need to show the report by different dimensions like area , service type , user etc.. Our dimensions will be small but the raw data is large like every 15 minutes I will collect from around 10 M resources and each resource will have 10-15 indicators.

I have following queries;

1) If I model my running data as fact table then this table will get refresh quite often like every 15 mins , what will be impact of cube building at such short refresh levels.

2) I understand Kylin has a roadmap of supporting streams , what kind be source inputs for streaming Kylin has considered , is it only storm ? or also such as spark streaming or can it also support stream of Json encoded data over http ?

3) What latency of queries I can expect if I have to query specific records over a day , month or year.

Any sugegstion regarding this is welcome.

Regards,
Santosh Akhilesh
Bangalore R&D
HUAWEI TECHNOLOGIES CO.,LTD.

www.huawei.com
-------------------------------------------------------------------------------------------------------------------------------------
This e-mail and its attachments contain confidential information from HUAWEI, which
is intended only for the person or entity whose address is listed above. Any use of the
information contained herein in any way (including, but not limited to, total or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

Re: Suggestion required for Modelling with Kylin

Posted by Li Yang <li...@apache.org>.

I try to be more specific on top of Luke's answers. :-)

      2) I understand Kylin has a roadmap of supporting streams , what kind
be source inputs for streaming Kylin has considered , is it only storm ? or
also such as spark streaming  or can it also support stream of Json encoded
data over http ?
      L: Streaming source should not be limited by Kylin, we will thinking
about to support more generic interface
      Y: There will be extension point to accommodate different transport
protocols and record parsers. For the first impl, we take Kafka as input
stream (you may add storm before Kafka if make sense), and Avro as record
schema (a common choice of Hadoop users). Support of Json over http will
very likely require your in house effort.

      3) What latency of queries I can expect if I have to query specific
records over a day , month or year.
      L: More filter condition, more efficiency with Kylin cube.
      Y: <5 seconds assuming return a few thousands rows. Kylin is not
designed for ETL purpose, returning millions of rows is not our goal.

Cheers
Yang

On Mon, Mar 2, 2015 at 8:44 PM, Luke Han <lu...@gmail.com> wrote:

> Hi Santosh,
>     Your requirement actually is very typical case for near real-time +
> historical analytics. Split data into different storage will be our
> solution for it: Support streaming to handle near real-time data (we are
> working on, coming with v0.7.x release), current batch job to pre-calculate
> daily/weekly/monthly... data into cube for historical data (already there).
>     Could you please share more detail with us for your case? It's perfect
> to pilot with your data and query for our new design.
>
>     Please refer to below answers for your questions:
>
>       1) If I model my running data as fact table then this table will get
> refresh quite often like every 15 mins , what will be impact of cube
> building at such short refresh levels.
>        L: Cube build process just read once for each incremental build, it
> depends on Hive. And, I think to leverage streaming is more make sense for
> such frequency updating data.
>
>       2) I understand Kylin has a roadmap of supporting streams , what kind
> be source inputs for streaming Kylin has considered , is it only storm ? or
> also such as spark streaming  or can it also support stream of Json encoded
> data over http ?
>       L: Streaming source should not be limited by Kylin, we will thinking
> about to support more generic interface
>
>       3) What latency of queries I can expect if I have to query specific
> records over a day , month or year.
>      L: More filter condition, more efficiency with Kylin cube.
>
> Looking forward for your cases story with Kylin.
> Thank you very much.
> Luke
>
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> 2015-03-02 20:01 GMT+08:00 Santoshakhilesh <sa...@huawei.com>:
>
> > Dear All ,
> >
> >
> >
> >       I work for a telecom major in NMS/OSS domain.
> >
> >       We have product which is responsible for performance monitoring of
> > network.
> >
> >
> >
> >       I have been assigned to evaluate Kylin for multi dimension queries.
> >
> >
> >
> >       My Use case is as below.
> >
> >
> >
> >       I collect performance indicator data from network elements every 15
> > minutes , network elsements can be card , interface , device etc..
> >
> >       I have a aggergation engine which does object aggergation every 15
> > minutes and time aggergation evry hour , day , week , month , year.
> >
> >
> >
> >       These dynamic datas need to be shown as report to user , which
> might
> > be interested in difefrent views.
> >
> >       a) Total traffic by area
> >
> >       b) Total bandwidth usage in particular area
> >
> >       c) Total tarffic by service type like 3G , 2G , etc...
> >
> >       d) Total bandwidth usage by user etc...
> >
> >
> >
> >
> >
> >      So basically the raw data collected at 15 minutes and time
> aggergated
> > by hourly , daily , weekly , yearly , I need to show the report by
> > different dimensions like area , service type , user etc.. Our dimensions
> > will be small but the raw data is large like every 15 minutes I will
> > collect from around 10 M resources and each resource will have 10-15
> > indicators.
> >
> >
> >
> >       I have following queries;
> >
> >
> >
> >       1) If I model my running data as fact table then this table will
> get
> > refresh quite often like every 15 mins , what will be impact of cube
> > building at such short refresh levels.
> >
> >       2) I understand Kylin has a roadmap of supporting streams , what
> > kind be source inputs for streaming Kylin has considered , is it only
> storm
> > ? or also such as spark streaming  or can it also support stream of Json
> > encoded data over http ?
> >
> >       3) What latency of queries I can expect if I have to query specific
> > records over a day , month or year.
> >
> >
> >
> >      Any sugegstion regarding this is welcome.
> >
> >
> >
> >
> >
> >
> >
> > Regards,
> > Santosh Akhilesh
> > Bangalore R&D
> > HUAWEI TECHNOLOGIES CO.,LTD.
> >
> > www.huawei.com
> >
> >
> -------------------------------------------------------------------------------------------------------------------------------------
> > This e-mail and its attachments contain confidential information from
> > HUAWEI, which
> > is intended only for the person or entity whose address is listed above.
> > Any use of the
> > information contained herein in any way (including, but not limited to,
> > total or partial
> > disclosure, reproduction, or dissemination) by persons other than the
> > intended
> > recipient(s) is prohibited. If you receive this e-mail in error, please
> > notify the sender by
> > phone or email immediately and delete it!
> >
>

Re: Suggestion required for Modelling with Kylin

Posted by Luke Han <lu...@gmail.com>.

Hi Santosh,
    Your requirement actually is very typical case for near real-time +
historical analytics. Split data into different storage will be our
solution for it: Support streaming to handle near real-time data (we are
working on, coming with v0.7.x release), current batch job to pre-calculate
daily/weekly/monthly... data into cube for historical data (already there).
    Could you please share more detail with us for your case? It's perfect
to pilot with your data and query for our new design.

    Please refer to below answers for your questions:

      1) If I model my running data as fact table then this table will get
refresh quite often like every 15 mins , what will be impact of cube
building at such short refresh levels.
       L: Cube build process just read once for each incremental build, it
depends on Hive. And, I think to leverage streaming is more make sense for
such frequency updating data.

      2) I understand Kylin has a roadmap of supporting streams , what kind
be source inputs for streaming Kylin has considered , is it only storm ? or
also such as spark streaming  or can it also support stream of Json encoded
data over http ?
      L: Streaming source should not be limited by Kylin, we will thinking
about to support more generic interface

      3) What latency of queries I can expect if I have to query specific
records over a day , month or year.
     L: More filter condition, more efficiency with Kylin cube.

Looking forward for your cases story with Kylin.
Thank you very much.
Luke



Best Regards!
---------------------

Luke Han

2015-03-02 20:01 GMT+08:00 Santoshakhilesh <sa...@huawei.com>:

> Dear All ,
>
>
>
>       I work for a telecom major in NMS/OSS domain.
>
>       We have product which is responsible for performance monitoring of
> network.
>
>
>
>       I have been assigned to evaluate Kylin for multi dimension queries.
>
>
>
>       My Use case is as below.
>
>
>
>       I collect performance indicator data from network elements every 15
> minutes , network elsements can be card , interface , device etc..
>
>       I have a aggergation engine which does object aggergation every 15
> minutes and time aggergation evry hour , day , week , month , year.
>
>
>
>       These dynamic datas need to be shown as report to user , which might
> be interested in difefrent views.
>
>       a) Total traffic by area
>
>       b) Total bandwidth usage in particular area
>
>       c) Total tarffic by service type like 3G , 2G , etc...
>
>       d) Total bandwidth usage by user etc...
>
>
>
>
>
>      So basically the raw data collected at 15 minutes and time aggergated
> by hourly , daily , weekly , yearly , I need to show the report by
> different dimensions like area , service type , user etc.. Our dimensions
> will be small but the raw data is large like every 15 minutes I will
> collect from around 10 M resources and each resource will have 10-15
> indicators.
>
>
>
>       I have following queries;
>
>
>
>       1) If I model my running data as fact table then this table will get
> refresh quite often like every 15 mins , what will be impact of cube
> building at such short refresh levels.
>
>       2) I understand Kylin has a roadmap of supporting streams , what
> kind be source inputs for streaming Kylin has considered , is it only storm
> ? or also such as spark streaming  or can it also support stream of Json
> encoded data over http ?
>
>       3) What latency of queries I can expect if I have to query specific
> records over a day , month or year.
>
>
>
>      Any sugegstion regarding this is welcome.
>
>
>
>
>
>
>
> Regards,
> Santosh Akhilesh
> Bangalore R&D
> HUAWEI TECHNOLOGIES CO.,LTD.
>
> www.huawei.com
>
> -------------------------------------------------------------------------------------------------------------------------------------
> This e-mail and its attachments contain confidential information from
> HUAWEI, which
> is intended only for the person or entity whose address is listed above.
> Any use of the
> information contained herein in any way (including, but not limited to,
> total or partial
> disclosure, reproduction, or dissemination) by persons other than the
> intended
> recipient(s) is prohibited. If you receive this e-mail in error, please
> notify the sender by
> phone or email immediately and delete it!
>