You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by Vikram Kone <vi...@gmail.com> on 2015/02/27 17:51:22 UTC

Need advice for kylin newbie

Hi,
I'm a newbie when it comes to Kylin and Hadoop eco system in general. Our
team has been predominantly a Microsoft shop that uses MS stack for most of
their BI needs. So we are talking SQL server  for storing relational data
and SQL Server Analysis services for building MOLAP cubes for sub-second
query analysis.
Lately, we have been hitting degradation in our cube query response times
as our data sizes grew considerably the past year. We are talking fact
tables which are in 1o-100 billions of rows range and a few dimensions in
the 10-100's of millions of rows. We tried vertically scaling up our SSAS
server but queries are still taking few minutes. In light of this, I was
entrusted with task of figuring out an open source solution that would
scale to our current and future needs for data analysis.
I looked at a bunch of open source tools like Apache Drill, Druid, AtScale,
Spark, Storm, Kylin etc and settled on exploring kylin  as the first step
given it's recent rise in popularity and growing eco-system around it.
I started to build out a POC for our MOLAP cubes using kylin with HDFS/Hive
as the datasource and see how it scales for our queries/measures in real
time with real data. The setup has been a nightmare so far. Configuration
of the cluster takes too long. I tried the docker version and it fails with
cryptic errors. Then tried installing it using the build from root option
on a hadopp cluster and seeing more issues while building issues related to
cube building. Same with binary package installation. It's just taking too
long to set up. There should be an easier way to do this :(
Roughly, these are the requirements for our team
1. Should be able to create facts, dimensions and measures from our data
sets in an easier way.
2. Cubes should be query able from Excel and Tableau.
3. Easily scale out by adding new nodes when data grows
4. Very less maintenance and highly stable for production level workloads
5. Sub second query latencies for COUNT DISTINCT measures (since majority
of our expensive measures are of this type) . Are ok with Approx Distinct
counts for better perf.

So given these requirements, is Kylin the right solution to replace our
on-premise MOLAP cubes?  As long as our users can pivot/slice & dice the
measures quickly from client tools like excel ND tableau by dragging
dropping dimensions into rows/columns w/o the need to join to fact table,
we are ok with however the data is laid out. Doesn't have to be a cube. It
can be a flat file in hdfs for all we care. I would love to chat with some
one who has successfully done this kind of migration from SSAS OLAP cubes
to KYLIN  in their team or company AND learn about pros n cons before I
spend more time Co figuring this stuff.

This is it for now. Looking forward to a great discussion.

P.S. We have decided on using Azure as our managed hadoop system in the
cloud.

Re: Need advice for kylin newbie

Posted by Luke Han <lu...@gmail.com>.
Hi Vikram,
    Would like to confirm this one: "Configuration of the cluster takes too
long."
    Do you mean setup Hadoop Cluster and Kylin Server? or just Kylin server?

    BTW, For Kylin installation, as Yang mentioned, please refer to new
version (v0.7.x) with binary package here:
http://kylin.incubator.apache.org/download/ Just updated with bug fixing
which blocked save cube previous. This package should be easy to setup on
Hadoop cluster even cloud.
    Please feel free to let's know if there's any issue.

    Thanks.

Luke


Best Regards!
---------------------

Luke Han

2015-03-02 18:40 GMT+08:00 Li Yang <li...@apache.org>:

> Hi Vikram,
>
> Great thanks to your precious feedback!
>
> From your requirements, Kylin is actually a very good fit. However the
> installation is really painful we know, and is working hard to improve it.
> The new 0.7.x will have a binary release, should be able to unpack and run
> with Hortonworks and Cloudera distribution with zero configuration.
> For Azure...,
> well, haven't tried yet, but that's where you can help us.
>
> On the other side, we will be glad to help with your POC. If possible,
> share your (sample) data modal and query patterns. We can suggest the best
> cube design of Kylin.
>
>
> Cheers
> Yang
>
>
> On Sat, Feb 28, 2015 at 2:12 AM, Adunuthula, Seshu <sa...@ebay.com>
> wrote:
>
> > Vikram,
> >
> > Thank you for a honest and direct feedback on Kylin. As you had rightly
> > called
> > out the sweetspot for Kylin is the ability to do MOLAP on 10-100 billions
> > of
> > Rows with sub second query responses. So we believe Kylin is the right
> > tools
> > for your requirements below.
> >
> > > So given these requirements, is Kylin the right solution to replace our
> > > on-premise MOLAP cubes?  As long as our users can pivot/slice & dice
> the
> > > measures quickly from client tools like excel ND tableau by dragging
> > > dropping dimensions into rows/columns w/o the need to join to fact
> table,
> >
> >
> >
> > Docker is useful for single machine Developer deployments and I have
> found
> > that
> > a certain level of Docker expertise is needed before you can successful
> > deploy
> > Them.
> >
> > You are doing a certain set of firsts that could be making your setup a
> > nightmare.
> > Using Azure as the managed Hadoop System would certainly be a first for
> > the Kylin
> > team and you might be running into.
> >
> > That said we "Kylin team" are interested in making your POC successful,
> > and as
> > with any open source there is some assembly required, are you as a team
> > setup
> > for Development activities? If so we can have team to team meetings to
> > determine
> > What takes to make the POC successful
> >
> > On 2/27/15, 8:51 AM, "Vikram Kone" <vi...@gmail.com> wrote:
> >
> > >Hi,
> > >I'm a newbie when it comes to Kylin and Hadoop eco system in general.
> Our
> > >team has been predominantly a Microsoft shop that uses MS stack for most
> > >of
> > >their BI needs. So we are talking SQL server  for storing relational
> data
> > >and SQL Server Analysis services for building MOLAP cubes for sub-second
> > >query analysis.
> > >Lately, we have been hitting degradation in our cube query response
> times
> > >as our data sizes grew considerably the past year. We are talking fact
> > >tables which are in 1o-100 billions of rows range and a few dimensions
> in
> > >the 10-100's of millions of rows. We tried vertically scaling up our
> SSAS
> > >server but queries are still taking few minutes. In light of this, I was
> > >entrusted with task of figuring out an open source solution that would
> > >scale to our current and future needs for data analysis.
> > >I looked at a bunch of open source tools like Apache Drill, Druid,
> > >AtScale,
> > >Spark, Storm, Kylin etc and settled on exploring kylin  as the first
> step
> > >given it's recent rise in popularity and growing eco-system around it.
> > >I started to build out a POC for our MOLAP cubes using kylin with
> > >HDFS/Hive
> > >as the datasource and see how it scales for our queries/measures in real
> > >time with real data. The setup has been a nightmare so far.
> Configuration
> > >of the cluster takes too long. I tried the docker version and it fails
> > >with
> > >cryptic errors. Then tried installing it using the build from root
> option
> > >on a hadopp cluster and seeing more issues while building issues related
> > >to
> > >cube building. Same with binary package installation. It's just taking
> too
> > >long to set up. There should be an easier way to do this :(
> > >Roughly, these are the requirements for our team
> > >1. Should be able to create facts, dimensions and measures from our data
> > >sets in an easier way.
> > >2. Cubes should be query able from Excel and Tableau.
> > >3. Easily scale out by adding new nodes when data grows
> > >4. Very less maintenance and highly stable for production level
> workloads
> > >5. Sub second query latencies for COUNT DISTINCT measures (since
> majority
> > >of our expensive measures are of this type) . Are ok with Approx
> Distinct
> > >counts for better perf.
> > >
> > >So given these requirements, is Kylin the right solution to replace our
> > >on-premise MOLAP cubes?  As long as our users can pivot/slice & dice the
> > >measures quickly from client tools like excel ND tableau by dragging
> > >dropping dimensions into rows/columns w/o the need to join to fact
> table,
> > >we are ok with however the data is laid out. Doesn't have to be a cube.
> It
> > >can be a flat file in hdfs for all we care. I would love to chat with
> some
> > >one who has successfully done this kind of migration from SSAS OLAP
> cubes
> > >to KYLIN  in their team or company AND learn about pros n cons before I
> > >spend more time Co figuring this stuff.
> > >
> > >This is it for now. Looking forward to a great discussion.
> > >
> > >P.S. We have decided on using Azure as our managed hadoop system in the
> > >cloud.
> >
> >
>

Re: Need advice for kylin newbie

Posted by Li Yang <li...@apache.org>.
Hi Vikram,

Great thanks to your precious feedback!

>From your requirements, Kylin is actually a very good fit. However the
installation is really painful we know, and is working hard to improve it.
The new 0.7.x will have a binary release, should be able to unpack and run
with Hortonworks and Cloudera distribution with zero configuration.
For Azure...,
well, haven't tried yet, but that's where you can help us.

On the other side, we will be glad to help with your POC. If possible,
share your (sample) data modal and query patterns. We can suggest the best
cube design of Kylin.


Cheers
Yang


On Sat, Feb 28, 2015 at 2:12 AM, Adunuthula, Seshu <sa...@ebay.com>
wrote:

> Vikram,
>
> Thank you for a honest and direct feedback on Kylin. As you had rightly
> called
> out the sweetspot for Kylin is the ability to do MOLAP on 10-100 billions
> of
> Rows with sub second query responses. So we believe Kylin is the right
> tools
> for your requirements below.
>
> > So given these requirements, is Kylin the right solution to replace our
> > on-premise MOLAP cubes?  As long as our users can pivot/slice & dice the
> > measures quickly from client tools like excel ND tableau by dragging
> > dropping dimensions into rows/columns w/o the need to join to fact table,
>
>
>
> Docker is useful for single machine Developer deployments and I have found
> that
> a certain level of Docker expertise is needed before you can successful
> deploy
> Them.
>
> You are doing a certain set of firsts that could be making your setup a
> nightmare.
> Using Azure as the managed Hadoop System would certainly be a first for
> the Kylin
> team and you might be running into.
>
> That said we "Kylin team" are interested in making your POC successful,
> and as
> with any open source there is some assembly required, are you as a team
> setup
> for Development activities? If so we can have team to team meetings to
> determine
> What takes to make the POC successful
>
> On 2/27/15, 8:51 AM, "Vikram Kone" <vi...@gmail.com> wrote:
>
> >Hi,
> >I'm a newbie when it comes to Kylin and Hadoop eco system in general. Our
> >team has been predominantly a Microsoft shop that uses MS stack for most
> >of
> >their BI needs. So we are talking SQL server  for storing relational data
> >and SQL Server Analysis services for building MOLAP cubes for sub-second
> >query analysis.
> >Lately, we have been hitting degradation in our cube query response times
> >as our data sizes grew considerably the past year. We are talking fact
> >tables which are in 1o-100 billions of rows range and a few dimensions in
> >the 10-100's of millions of rows. We tried vertically scaling up our SSAS
> >server but queries are still taking few minutes. In light of this, I was
> >entrusted with task of figuring out an open source solution that would
> >scale to our current and future needs for data analysis.
> >I looked at a bunch of open source tools like Apache Drill, Druid,
> >AtScale,
> >Spark, Storm, Kylin etc and settled on exploring kylin  as the first step
> >given it's recent rise in popularity and growing eco-system around it.
> >I started to build out a POC for our MOLAP cubes using kylin with
> >HDFS/Hive
> >as the datasource and see how it scales for our queries/measures in real
> >time with real data. The setup has been a nightmare so far. Configuration
> >of the cluster takes too long. I tried the docker version and it fails
> >with
> >cryptic errors. Then tried installing it using the build from root option
> >on a hadopp cluster and seeing more issues while building issues related
> >to
> >cube building. Same with binary package installation. It's just taking too
> >long to set up. There should be an easier way to do this :(
> >Roughly, these are the requirements for our team
> >1. Should be able to create facts, dimensions and measures from our data
> >sets in an easier way.
> >2. Cubes should be query able from Excel and Tableau.
> >3. Easily scale out by adding new nodes when data grows
> >4. Very less maintenance and highly stable for production level workloads
> >5. Sub second query latencies for COUNT DISTINCT measures (since majority
> >of our expensive measures are of this type) . Are ok with Approx Distinct
> >counts for better perf.
> >
> >So given these requirements, is Kylin the right solution to replace our
> >on-premise MOLAP cubes?  As long as our users can pivot/slice & dice the
> >measures quickly from client tools like excel ND tableau by dragging
> >dropping dimensions into rows/columns w/o the need to join to fact table,
> >we are ok with however the data is laid out. Doesn't have to be a cube. It
> >can be a flat file in hdfs for all we care. I would love to chat with some
> >one who has successfully done this kind of migration from SSAS OLAP cubes
> >to KYLIN  in their team or company AND learn about pros n cons before I
> >spend more time Co figuring this stuff.
> >
> >This is it for now. Looking forward to a great discussion.
> >
> >P.S. We have decided on using Azure as our managed hadoop system in the
> >cloud.
>
>

Re: Need advice for kylin newbie

Posted by "Adunuthula, Seshu" <sa...@ebay.com>.
Vikram,

Thank you for a honest and direct feedback on Kylin. As you had rightly
called 
out the sweetspot for Kylin is the ability to do MOLAP on 10-100 billions
of 
Rows with sub second query responses. So we believe Kylin is the right
tools
for your requirements below.

> So given these requirements, is Kylin the right solution to replace our
> on-premise MOLAP cubes?  As long as our users can pivot/slice & dice the
> measures quickly from client tools like excel ND tableau by dragging
> dropping dimensions into rows/columns w/o the need to join to fact table,



Docker is useful for single machine Developer deployments and I have found
that 
a certain level of Docker expertise is needed before you can successful
deploy 
Them.

You are doing a certain set of firsts that could be making your setup a
nightmare. 
Using Azure as the managed Hadoop System would certainly be a first for
the Kylin 
team and you might be running into.

That said we "Kylin team" are interested in making your POC successful,
and as 
with any open source there is some assembly required, are you as a team
setup 
for Development activities? If so we can have team to team meetings to
determine
What takes to make the POC successful

On 2/27/15, 8:51 AM, "Vikram Kone" <vi...@gmail.com> wrote:

>Hi,
>I'm a newbie when it comes to Kylin and Hadoop eco system in general. Our
>team has been predominantly a Microsoft shop that uses MS stack for most
>of
>their BI needs. So we are talking SQL server  for storing relational data
>and SQL Server Analysis services for building MOLAP cubes for sub-second
>query analysis.
>Lately, we have been hitting degradation in our cube query response times
>as our data sizes grew considerably the past year. We are talking fact
>tables which are in 1o-100 billions of rows range and a few dimensions in
>the 10-100's of millions of rows. We tried vertically scaling up our SSAS
>server but queries are still taking few minutes. In light of this, I was
>entrusted with task of figuring out an open source solution that would
>scale to our current and future needs for data analysis.
>I looked at a bunch of open source tools like Apache Drill, Druid,
>AtScale,
>Spark, Storm, Kylin etc and settled on exploring kylin  as the first step
>given it's recent rise in popularity and growing eco-system around it.
>I started to build out a POC for our MOLAP cubes using kylin with
>HDFS/Hive
>as the datasource and see how it scales for our queries/measures in real
>time with real data. The setup has been a nightmare so far. Configuration
>of the cluster takes too long. I tried the docker version and it fails
>with
>cryptic errors. Then tried installing it using the build from root option
>on a hadopp cluster and seeing more issues while building issues related
>to
>cube building. Same with binary package installation. It's just taking too
>long to set up. There should be an easier way to do this :(
>Roughly, these are the requirements for our team
>1. Should be able to create facts, dimensions and measures from our data
>sets in an easier way.
>2. Cubes should be query able from Excel and Tableau.
>3. Easily scale out by adding new nodes when data grows
>4. Very less maintenance and highly stable for production level workloads
>5. Sub second query latencies for COUNT DISTINCT measures (since majority
>of our expensive measures are of this type) . Are ok with Approx Distinct
>counts for better perf.
>
>So given these requirements, is Kylin the right solution to replace our
>on-premise MOLAP cubes?  As long as our users can pivot/slice & dice the
>measures quickly from client tools like excel ND tableau by dragging
>dropping dimensions into rows/columns w/o the need to join to fact table,
>we are ok with however the data is laid out. Doesn't have to be a cube. It
>can be a flat file in hdfs for all we care. I would love to chat with some
>one who has successfully done this kind of migration from SSAS OLAP cubes
>to KYLIN  in their team or company AND learn about pros n cons before I
>spend more time Co figuring this stuff.
>
>This is it for now. Looking forward to a great discussion.
>
>P.S. We have decided on using Azure as our managed hadoop system in the
>cloud.