You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by Long Zhou <lo...@gmail.com> on 2015/02/26 16:39:17 UTC

Choosing between Kylin and Lens

[delivery to user@kylin failed, resend to dev@kylin]

Hi Kylin and Lens communities,

    I am working on a big data analysis project and consider using Kylin or
Lens. Do you have some guidelines/recommendations on how to choose the
right solution? We are particularly interested in the performance
characteristics of these two solutions on terabytes of sparse data.
    I just started learning the two projects. It seems Kylin is more like
MOLAP while Lens is more like ROLAP, is that correct? Does the differences
between MOLAP and ROLAP apply here?
    When using Hive as storage, it seems Kylin might perform better since
data is pre-aggregated and cached. How does Kylin handle sparse tables and
avoid empty cells in cache? Does Lens have cache on top of Hive?
    Lens supports columnar data warehouses like Redshift. How much
performance could we gain by loading data to Redshift?
    Where can I find performance benchmark data for the two projects?

Best regards,
Long Zhou

Re: Choosing between Kylin and Lens

Posted by Li Yang <li...@apache.org>.

Answer from Kylin perspective. :-)

The same is there's no performance benchmark at the moment.

> Do you have some guidelines/recommendations on how to choose the right
solution?

Kylin's advantage is pre-calculation of join and aggregation. If your query
is at high aggregation level, or has many joins, Kylin will have an edge.
In addition, Kylin is of Hadoop family and has an ANSI SQL interface that
differentiate from some other solutions.

>     When using Hive as storage, it seems Kylin might perform better since
> data is pre-aggregated and cached.

Kylin uses HBase as storage of cube. Hive table is the input. Data is read
from Hive, build into cube with mapreduce, and stored in HBase. User write
queries against the origin Hive table and Kylin will answer from the cube
without accessing Hive at runtime.

> How does Kylin handle sparse tables and avoid empty cells in cache?

Data is encoded using dictionary and then stored in cube. So every value in
cube is a code point of minimal length, including empties.


Cheers
Yang


On Fri, Feb 27, 2015 at 3:15 PM, amareshwarisr . <am...@gmail.com>
wrote:

> Hello Long Zhou,
>
> Thanks for reaching out. I'm developer at Lens and trying to answer your
> questions with respect to Lens.
>
> On Thu, Feb 26, 2015 at 9:09 PM, Long Zhou <lo...@gmail.com> wrote:
>
> > [delivery to user@kylin failed, resend to dev@kylin]
> >
> > Hi Kylin and Lens communities,
> >
> >     I am working on a big data analysis project and consider using Kylin
> > or Lens. Do you have some guidelines/recommendations on how to choose the
> > right solution? We are particularly interested in the performance
> > characteristics of these two solutions on terabytes of sparse data.
> >
>
> We don't have guidelines/recommendations/performance characteristics
> documented anywhere as of now. But user documentation should help you with
> some details of the system. Lens itself does not have any overhead with
> respect to query execution, it would be given to underlying engine and the
> performance numbers published in underlying systems should be sufficient.
>
>
> >     I just started learning the two projects. It seems Kylin is more like
> > MOLAP while Lens is more like ROLAP, is that correct? Does the
> differences
> > between MOLAP and ROLAP apply here?
> >
>
> I  agree with Lens that it is ROLAP like system. We can say Lens can become
> HOLAP (http://en.wikipedia.org/wiki/ROLAP,
> http://en.wikipedia.org/wiki/HOLAP,
> http://www.1keydata.com/datawarehousing/molap-rolap.html). And as said in
> ROLAP, performance of Lens depends on underlying execution engines and if
> the data is not aggregated, it would pick detailed tables for answering.
> But if aggregated data is available through an ETL process, it would make
> use of it.
>
>     When using Hive as storage, it seems Kylin might perform better since
> > data is pre-aggregated and cached. How does Kylin handle sparse tables
> and
> > avoid empty cells in cache? Does Lens have cache on top of Hive?
> >
>
> No, Lens does not have any cache on top of Hive.
>
>
> >     Lens supports columnar data warehouses like Redshift. How much
> > performance could we gain by loading data to Redshift? Where can I find
> > performance benchmark data for the two projects?
> >
>
> It would be same as how fast Redshift can answer queries. Lens comes with
> JDBCDriver for reaching systems which can understand jdbc. At inmobi, we
> are using it with Columnar dataware house - InfoBright (
> https://www.infobright.com/) in production, it should work with Redshift
> as
> well, but it is not yet tested with RedShift.
>
> Thanks
> Amareshwari
>

Re: Choosing between Kylin and Lens

Posted by "amareshwarisr ." <am...@gmail.com>.

Hello Long Zhou,

Thanks for reaching out. I'm developer at Lens and trying to answer your
questions with respect to Lens.

On Thu, Feb 26, 2015 at 9:09 PM, Long Zhou <lo...@gmail.com> wrote:

> [delivery to user@kylin failed, resend to dev@kylin]
>
> Hi Kylin and Lens communities,
>
>     I am working on a big data analysis project and consider using Kylin
> or Lens. Do you have some guidelines/recommendations on how to choose the
> right solution? We are particularly interested in the performance
> characteristics of these two solutions on terabytes of sparse data.
>

We don't have guidelines/recommendations/performance characteristics
documented anywhere as of now. But user documentation should help you with
some details of the system. Lens itself does not have any overhead with
respect to query execution, it would be given to underlying engine and the
performance numbers published in underlying systems should be sufficient.

>     I just started learning the two projects. It seems Kylin is more like
> MOLAP while Lens is more like ROLAP, is that correct? Does the differences
> between MOLAP and ROLAP apply here?
>

I  agree with Lens that it is ROLAP like system. We can say Lens can become
HOLAP (http://en.wikipedia.org/wiki/ROLAP,
http://en.wikipedia.org/wiki/HOLAP,
http://www.1keydata.com/datawarehousing/molap-rolap.html). And as said in
ROLAP, performance of Lens depends on underlying execution engines and if
the data is not aggregated, it would pick detailed tables for answering.
But if aggregated data is available through an ETL process, it would make
use of it.

    When using Hive as storage, it seems Kylin might perform better since
> data is pre-aggregated and cached. How does Kylin handle sparse tables and
> avoid empty cells in cache? Does Lens have cache on top of Hive?
>

No, Lens does not have any cache on top of Hive.

>     Lens supports columnar data warehouses like Redshift. How much
> performance could we gain by loading data to Redshift? Where can I find
> performance benchmark data for the two projects?
>

It would be same as how fast Redshift can answer queries. Lens comes with
JDBCDriver for reaching systems which can understand jdbc. At inmobi, we
are using it with Columnar dataware house - InfoBright (
https://www.infobright.com/) in production, it should work with Redshift as
well, but it is not yet tested with RedShift.

Thanks
Amareshwari

Re: Choosing between Kylin and Lens

Posted by "amareshwarisr ." <am...@gmail.com>.

Hello Long Zhou,

Thanks for reaching out. I'm developer at Lens and trying to answer your
questions with respect to Lens.

On Thu, Feb 26, 2015 at 9:09 PM, Long Zhou <lo...@gmail.com> wrote:

> [delivery to user@kylin failed, resend to dev@kylin]
>
> Hi Kylin and Lens communities,
>
>     I am working on a big data analysis project and consider using Kylin
> or Lens. Do you have some guidelines/recommendations on how to choose the
> right solution? We are particularly interested in the performance
> characteristics of these two solutions on terabytes of sparse data.
>

We don't have guidelines/recommendations/performance characteristics
documented anywhere as of now. But user documentation should help you with
some details of the system. Lens itself does not have any overhead with
respect to query execution, it would be given to underlying engine and the
performance numbers published in underlying systems should be sufficient.

>     I just started learning the two projects. It seems Kylin is more like
> MOLAP while Lens is more like ROLAP, is that correct? Does the differences
> between MOLAP and ROLAP apply here?
>

I  agree with Lens that it is ROLAP like system. We can say Lens can become
HOLAP (http://en.wikipedia.org/wiki/ROLAP,
http://en.wikipedia.org/wiki/HOLAP,
http://www.1keydata.com/datawarehousing/molap-rolap.html). And as said in
ROLAP, performance of Lens depends on underlying execution engines and if
the data is not aggregated, it would pick detailed tables for answering.
But if aggregated data is available through an ETL process, it would make
use of it.

    When using Hive as storage, it seems Kylin might perform better since
> data is pre-aggregated and cached. How does Kylin handle sparse tables and
> avoid empty cells in cache? Does Lens have cache on top of Hive?
>

No, Lens does not have any cache on top of Hive.

>     Lens supports columnar data warehouses like Redshift. How much
> performance could we gain by loading data to Redshift? Where can I find
> performance benchmark data for the two projects?
>

It would be same as how fast Redshift can answer queries. Lens comes with
JDBCDriver for reaching systems which can understand jdbc. At inmobi, we
are using it with Columnar dataware house - InfoBright (
https://www.infobright.com/) in production, it should work with Redshift as
well, but it is not yet tested with RedShift.

Thanks
Amareshwari