You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by 李刚 <li...@58.com> on 2015/08/12 08:19:16 UTC

kylin 性能问题

你好 你们有测试过kylin的性能吗？我们有每天3000w条记录，需要进行合并，生成魔方，供前端查询使用，生成完数据时间应该不能很长，大概在3小时内，请问kylin 能胜任吗？你们实测的记录是什么样的？

Re: kylin sample problem

Posted by Li Yang <li...@apache.org>.

Kylin document is at http://kylin.incubator.apache.org/docs/

Most people start there.

If you are looking for a walk through that guides from a hive data model to
create a cube, then no, there's no such document, but we are working on
one. Should be out in one or two month.

On Fri, Aug 21, 2015 at 3:43 PM, hongbin ma <ma...@apache.org> wrote:

> hi gang
>
> what do you mean by "sample doc"? explanations on the sample cube?
>
> On Fri, Aug 21, 2015 at 1:36 PM, 李刚 <li...@58.com> wrote:
>
> > there`s no kylin samples doc ,i don`t know how to use hive  ,use hbase to
> > store data ? i hope that have some explain
> >
> >
> >
> >
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>

Re: kylin sample problem

Posted by hongbin ma <ma...@apache.org>.

hi gang

what do you mean by "sample doc"? explanations on the sample cube?

On Fri, Aug 21, 2015 at 1:36 PM, 李刚 <li...@58.com> wrote:

> there`s no kylin samples doc ,i don`t know how to use hive  ,use hbase to
> store data ? i hope that have some explain
>
>
>
>


-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

kylin sample problem

Posted by 李刚 <li...@58.com>.

there`s no kylin samples doc ,i don`t know how to use hive  ,use hbase to store data ? i hope that have some explain

Re: kylin 性能问题

Posted by "Shi, Shaofeng" <sh...@ebay.com>.

Hi Gang,

Kylin leverages hadoop to do the cube building, and leverage HBase to
server the runtime query; Most of the computing happens in the hadoop and
HBase cluster, both are scaleable; So, the performance very likely depends
on the Cluster's size; Besides, some other aspects can impact the
performance, like the cube complexity, dimension cardinality, etc; We
couldn’t briefly answer yes or no to your question without the detail
inputs;


Liang Meng’s case is very good; He listed the cluster size, cube
dimensions, data cardinality, etc; Thanks Meng, this is a good reference
for all users; (Next time if you can translate it in English, that will be
great.)

Please allow me to translate it:

30 Million records per day is a small case; Let me give your our case:
50 nodes;
6 Billion records per day;
5 lookup tables, 8 dimensions;
One of the dimension has more than 10 million distinct values
(cardinality); Other dimensions’ cardinality is from tens of thousands to
hounds of thousands;
Build one day’s data will take about 200 minutes;

Most of the time was spent on:
1. Extracting data from hive (create the flat intermediate table): as we
restrict hive to using < 10% capacity of the cluster, so it is slow,
usually take about 1 hour;
2. Creating HFile, which takes about more than 1 hour;
Other steps will take about 1 hour;


On 8/12/15, 3:18 PM, "liangmeng" <13...@139.com> wrote:

>3000w太小case了，我给你一个我们的案例吧：
>50节点
>每天60亿条
>5张维表，8个维度
>其中有一个维度数据是千万级的，其他维度都是几万到几十万级别
>跑一天数据大概200分钟吧；
>
>主要耗时在：
>1、从hive表抽取数据，这一步因为我们限制了hive只能使用整个集群的10%资源，所以相对较慢，用了大
概1小时；
>2、cube最后生成hbase的hfile，用了大概1个多小时
>其他的汇聚时间差不多也是1小时多点吧；
>
>
>
>梁猛 
>中国移动广东公司 网管维护中心 网管支撑室
>电话：13802880779
>邮箱: liangmeng@gd.chinamobile.com  ，13802880779@139.com
>地址：广东省广州市珠江新城珠江西路11号 广东全球通大厦北3楼
>邮编：510623 
> 
>发件人： 李刚
>发送时间： 2015-08-12 14:19
>收件人： dev
>主题： kylin 性能问题
> 
>你好 你们有测试过kylin的性能吗？我们有每天3000w条记录，需要进行合并，生成魔方，供前端查询使用，
生成完数据时间应该不能很长，大概在3小时
>内，请问kylin 能胜任吗？你们实测的记录是什么样的？

Re: kylin 性能问题

Posted by hongbin ma <ma...@apache.org>.

please try to communicate in English...

2015-08-12 15:18 GMT+08:00 liangmeng <13...@139.com>:

> 3000w太小case了，我给你一个我们的案例吧：
> 50节点
> 每天60亿条
> 5张维表，8个维度
> 其中有一个维度数据是千万级的，其他维度都是几万到几十万级别
> 跑一天数据大概200分钟吧；
>
> 主要耗时在：
> 1、从hive表抽取数据，这一步因为我们限制了hive只能使用整个集群的10%资源，所以相对较慢，用了大概1小时；
> 2、cube最后生成hbase的hfile，用了大概1个多小时
> 其他的汇聚时间差不多也是1小时多点吧；
>
>
>
> 梁猛
> 中国移动广东公司 网管维护中心 网管支撑室
> 电话：13802880779
> 邮箱: liangmeng@gd.chinamobile.com  ，13802880779@139.com
> 地址：广东省广州市珠江新城珠江西路11号 广东全球通大厦北3楼
> 邮编：510623
>
> 发件人： 李刚
> 发送时间： 2015-08-12 14:19
> 收件人： dev
> 主题： kylin 性能问题
>
> 你好
> 你们有测试过kylin的性能吗？我们有每天3000w条记录，需要进行合并，生成魔方，供前端查询使用，生成完数据时间应该不能很长，大概在3小时内，请问kylin
> 能胜任吗？你们实测的记录是什么样的？
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Re: Re: kylin 性能问题

Posted by Luke Han <lu...@apache.org>.

Thanks Gang to raise this discussion, there are many people have
interesting about this,
Thanks Meng to share your case and performance number,
Thanks Shaofeng to translate...

Wow, as we say diversity, should Language option also be considering;-)

Anyway, Meng's case is really great reference for who would like to know
Kylin's capability and performance.

To comment on the "time spent on" part:
1. There's refactor already done to plug other SQL on Hadoop option to be
input source, for example, Spark SQL or Drill, also there's one ticket
about enable Hive on Spark. Once those done, the hive
query part duration will reduce very much.

2. HFile, which is one potential part should be improved, please help to
comments if anyone has experience about this, even with new feature of
HBase v1.x

And, if the data source has streaming capability, there's Kylin Streaming
almost there to run micro batch to aggregate data in several minutes.

Thanks.

2015-08-12 16:48 GMT+08:00 liangmeng <13...@139.com>:

> 我们每小时会做一次etl（抽取字段，并做一次细粒度的group by），每次运行要10来分钟吧
>
>
>
> 梁猛
> 中国移动广东公司 网管维护中心 网管支撑室
> 电话：13802880779
> 邮箱: liangmeng@gd.chinamobile.com  ，13802880779@139.com
> 地址：广东省广州市珠江新城珠江西路11号 广东全球通大厦北3楼
> 邮编：510623
>
> 发件人： 李刚
> 发送时间： 2015-08-12 15:32
> 收件人： dev
> 主题： Re: kylin 性能问题
> 你们这60亿条记录，不进行ETL吗？直接将这60亿导入hive里？
>
> 你们每天跑的定时是怎么做的呢？开发定时程序调用的脚本吗？
>
>
> 发件人： liangmeng
> 发送时间： 2015-08-12 15:18
> 收件人： dev
> 主题： Re: kylin 性能问题
> 3000w太小case了，我给你一个我们的案例吧：
> 50节点
> 每天60亿条
> 5张维表，8个维度
> 其中有一个维度数据是千万级的，其他维度都是几万到几十万级别
> 跑一天数据大概200分钟吧；
> 主要耗时在：
> 1、从hive表抽取数据，这一步因为我们限制了hive只能使用整个集群的10%资源，所以相对较慢，用了大概1小时；
> 2、cube最后生成hbase的hfile，用了大概1个多小时
> 其他的汇聚时间差不多也是1小时多点吧；
> 梁猛
> 中国移动广东公司 网管维护中心 网管支撑室
> 电话：13802880779
> 邮箱: liangmeng@gd.chinamobile.com  ，13802880779@139.com
> 地址：广东省广州市珠江新城珠江西路11号 广东全球通大厦北3楼
> 邮编：510623
> 发件人： 李刚
> 发送时间： 2015-08-12 14:19
> 收件人： dev
> 主题： kylin 性能问题
> 你好
> 你们有测试过kylin的性能吗？我们有每天3000w条记录，需要进行合并，生成魔方，供前端查询使用，生成完数据时间应该不能很长，大概在3小时内，请问kylin
> 能胜任吗？你们实测的记录是什么样的？
>

Re: Re: kylin 性能问题

Posted by liangmeng <13...@139.com>.

我们每小时会做一次etl（抽取字段，并做一次细粒度的group by），每次运行要10来分钟吧



梁猛 
中国移动广东公司 网管维护中心 网管支撑室 
电话：13802880779
邮箱: liangmeng@gd.chinamobile.com  ，13802880779@139.com
地址：广东省广州市珠江新城珠江西路11号 广东全球通大厦北3楼 
邮编：510623 
 
发件人： 李刚
发送时间： 2015-08-12 15:32
收件人： dev
主题： Re: kylin 性能问题
你们这60亿条记录，不进行ETL吗？直接将这60亿导入hive里？
 
你们每天跑的定时是怎么做的呢？开发定时程序调用的脚本吗？
 
 
发件人： liangmeng
发送时间： 2015-08-12 15:18
收件人： dev
主题： Re: kylin 性能问题
3000w太小case了，我给你一个我们的案例吧：
50节点
每天60亿条
5张维表，8个维度
其中有一个维度数据是千万级的，其他维度都是几万到几十万级别
跑一天数据大概200分钟吧；
主要耗时在：
1、从hive表抽取数据，这一步因为我们限制了hive只能使用整个集群的10%资源，所以相对较慢，用了大概1小时；
2、cube最后生成hbase的hfile，用了大概1个多小时
其他的汇聚时间差不多也是1小时多点吧；
梁猛 
中国移动广东公司 网管维护中心 网管支撑室 
电话：13802880779
邮箱: liangmeng@gd.chinamobile.com  ，13802880779@139.com
地址：广东省广州市珠江新城珠江西路11号 广东全球通大厦北3楼 
邮编：510623 
发件人： 李刚
发送时间： 2015-08-12 14:19
收件人： dev
主题： kylin 性能问题
你好 你们有测试过kylin的性能吗？我们有每天3000w条记录，需要进行合并，生成魔方，供前端查询使用，生成完数据时间应该不能很长，大概在3小时内，请问kylin 能胜任吗？你们实测的记录是什么样的？

Re: kylin 性能问题

Posted by 李刚 <li...@58.com>.

你们这60亿条记录，不进行ETL吗？直接将这60亿导入hive里？

你们每天跑的定时是怎么做的呢？开发定时程序调用的脚本吗？


发件人： liangmeng
发送时间： 2015-08-12 15:18
收件人： dev
主题： Re: kylin 性能问题
3000w太小case了，我给你一个我们的案例吧：
50节点
每天60亿条
5张维表，8个维度
其中有一个维度数据是千万级的，其他维度都是几万到几十万级别
跑一天数据大概200分钟吧；
 
主要耗时在：
1、从hive表抽取数据，这一步因为我们限制了hive只能使用整个集群的10%资源，所以相对较慢，用了大概1小时；
2、cube最后生成hbase的hfile，用了大概1个多小时
其他的汇聚时间差不多也是1小时多点吧；
 
 
 
梁猛 
中国移动广东公司 网管维护中心 网管支撑室 
电话：13802880779
邮箱: liangmeng@gd.chinamobile.com  ，13802880779@139.com
地址：广东省广州市珠江新城珠江西路11号 广东全球通大厦北3楼 
邮编：510623 
发件人： 李刚
发送时间： 2015-08-12 14:19
收件人： dev
主题： kylin 性能问题
你好 你们有测试过kylin的性能吗？我们有每天3000w条记录，需要进行合并，生成魔方，供前端查询使用，生成完数据时间应该不能很长，大概在3小时内，请问kylin 能胜任吗？你们实测的记录是什么样的？

Re: kylin 性能问题

Posted by liangmeng <13...@139.com>.

3000w太小case了，我给你一个我们的案例吧：
50节点
每天60亿条
5张维表，8个维度
其中有一个维度数据是千万级的，其他维度都是几万到几十万级别
跑一天数据大概200分钟吧；

主要耗时在：
1、从hive表抽取数据，这一步因为我们限制了hive只能使用整个集群的10%资源，所以相对较慢，用了大概1小时；
2、cube最后生成hbase的hfile，用了大概1个多小时
其他的汇聚时间差不多也是1小时多点吧；



梁猛 
中国移动广东公司 网管维护中心 网管支撑室 
电话：13802880779
邮箱: liangmeng@gd.chinamobile.com  ，13802880779@139.com
地址：广东省广州市珠江新城珠江西路11号 广东全球通大厦北3楼 
邮编：510623 
 
发件人： 李刚
发送时间： 2015-08-12 14:19
收件人： dev
主题： kylin 性能问题
 
你好 你们有测试过kylin的性能吗？我们有每天3000w条记录，需要进行合并，生成魔方，供前端查询使用，生成完数据时间应该不能很长，大概在3小时内，请问kylin 能胜任吗？你们实测的记录是什么样的？