You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by 子落 <ya...@taobao.com> on 2013/08/12 11:25:46 UTC
Introduce mdrill project(opensource,maybe help full for apache drill`s develope)
it`s address is https://github.com/alibaba/mdrill ,i think some of the
information or desion maybe help full for apache drill dev.
Which is like apache drill or google power drill, it is base on
hadoop,lucene,solr,jstorm
Now in my project , has 10 tables, 47760506482 rows ,80~400columns. (run on
10 mathines, permachine ram:48GB,12*2TB disk)
Some of the search example.,like bellows:
select count(*) from r_rpt_cps_luna_item where thedate >='20130416' and
thedate <'20130811' limit 0,100
_____
totalRecords:1
count(*)
11108914892
times taken 4.031 seconds
select sum(landing_uv) from r_rpt_cps_luna_item where thedate >='20130416'
and thedate <'20130811' limit 0,100
_____
totalRecords:1
sum(landing_uv)
2.07678497E8
times taken 56.081 seconds
select dist(user_id) from r_rpt_cps_luna_item where thedate >='20130416' and
thedate <'20130811' limit 0,100
_____
totalRecords:1
dist(user_id)
1483008.0
times taken 246.147 seconds
select thedate,count(*) as cnt from r_rpt_cps_luna_item where thedate
>='20130416' and thedate <'20130811' group by thedate order by cnt desc
limit 0,3
_____
totalRecords:118
thedate
cnt
20130803
158301304
20130802
157748487
20130725
157047045
times taken 34.727 seconds
select thedate,user_id,count(*) as cnt from r_rpt_cps_luna_item where
thedate >='20130416' and thedate <'20130811' group by thedate,user_id order
by cnt desc limit 0,3
_____
totalRecords:10010
thedate
user_id
cnt
20130725
725677994
194397
20130725
101450072
192650
20130701
101450072
189107
times taken 149.316 seconds
select thedate,category_level1,count(*) as cnt from r_rpt_cps_luna_item
where thedate >='20130416' and thedate <'20130811' group by
thedate,category_level1 order by cnt desc limit 0,3
_____
totalRecords:10010
thedate
category_level1
cnt
20130803
16
26487658
20130802
16
26306163
20130725
16
26128576
times taken 94.989 seconds
select thedate,category_level1,category_level2,count(*) as cnt from
r_rpt_cps_luna_item where thedate >='20130416' and thedate <'20130811'
group by thedate,category_level1,category_level2 order by cnt desc limit 0,3
_____
totalRecords:10010
thedate
category_level1
category_level2
cnt
20130725
16
50010850
7315606
20130803
16
50010850
7006255
20130802
16
50010850
6936059
times taken 288.885 seconds
chinese introduce
1:mdrill旨在帮助用户在几秒到几十秒的时间内,分析百亿级别的任意维度组合的数
据。
2:mdrill是一个分布式的在线分析查询系统,基于hadoop,lucene,solr,jstorm等开源
系统作为实现,基于SQL的查询语法。 mdrill是一个能够对大量数据进行分布式处理的
软件框架。mdrill是快速的高性能的,他的底层因使用了索引、列式存储、以及内存
cache等技 术,使得数据扫描的速度大为增加。mdrill是分布式的,它以并行的方式工
作,通过并行处理加快处理速度。
3:基于mdrill应用的adhoc项目,使用了10台机器,存储了400亿的数据
==>每次扫描30亿的行数,响应时间在20秒~120秒左右(取决不同的查询条件与扫描的
列数)。
==>对100亿数据进行count(*),耗时为2秒,单列sum耗时在25秒,按照日期分组求
count和sum耗时47秒,按照用户id分组并且按照成交笔数排序去TopN 耗时 243秒。
Re: Introduce mdrill project(opensource,maybe help full for apache
drill`s develope)
Posted by Jacques Nadeau <ja...@apache.org>.
Interesting. Is there any other english documentation about it's purpose
and architecture?
On Mon, Aug 12, 2013 at 2:25 AM, 子落 <ya...@taobao.com> wrote:
> it`s address is https://github.com/alibaba/mdrill ,i think some of the
> information or desion maybe help full for apache drill dev.
>
>
>
> Which is like apache drill or google power drill, it is base on
> hadoop,lucene,solr,jstorm
>
>
>
> Now in my project , has 10 tables, 47760506482 rows ,80~400columns. (run on
> 10 mathines, permachine ram:48GB,12*2TB disk)
>
>
>
> Some of the search example.,like bellows:
>
>
>
> select count(*) from r_rpt_cps_luna_item where thedate >='20130416' and
> thedate <'20130811' limit 0,100
>
> _____
>
> totalRecords:1
>
>
> count(*)
>
>
> 11108914892
>
> times taken 4.031 seconds
>
>
>
>
>
> select sum(landing_uv) from r_rpt_cps_luna_item where thedate >='20130416'
> and thedate <'20130811' limit 0,100
>
> _____
>
> totalRecords:1
>
>
> sum(landing_uv)
>
>
> 2.07678497E8
>
> times taken 56.081 seconds
>
>
>
> select dist(user_id) from r_rpt_cps_luna_item where thedate >='20130416'
> and
> thedate <'20130811' limit 0,100
>
> _____
>
> totalRecords:1
>
>
> dist(user_id)
>
>
> 1483008.0
>
> times taken 246.147 seconds
>
>
>
> select thedate,count(*) as cnt from r_rpt_cps_luna_item where thedate
> >='20130416' and thedate <'20130811' group by thedate order by cnt desc
> limit 0,3
>
> _____
>
> totalRecords:118
>
>
> thedate
>
> cnt
>
>
> 20130803
>
> 158301304
>
>
> 20130802
>
> 157748487
>
>
> 20130725
>
> 157047045
>
> times taken 34.727 seconds
>
>
>
> select thedate,user_id,count(*) as cnt from r_rpt_cps_luna_item where
> thedate >='20130416' and thedate <'20130811' group by thedate,user_id
> order
> by cnt desc limit 0,3
>
> _____
>
> totalRecords:10010
>
>
> thedate
>
> user_id
>
> cnt
>
>
> 20130725
>
> 725677994
>
> 194397
>
>
> 20130725
>
> 101450072
>
> 192650
>
>
> 20130701
>
> 101450072
>
> 189107
>
> times taken 149.316 seconds
>
>
>
> select thedate,category_level1,count(*) as cnt from r_rpt_cps_luna_item
> where thedate >='20130416' and thedate <'20130811' group by
> thedate,category_level1 order by cnt desc limit 0,3
>
> _____
>
> totalRecords:10010
>
>
> thedate
>
> category_level1
>
> cnt
>
>
> 20130803
>
> 16
>
> 26487658
>
>
> 20130802
>
> 16
>
> 26306163
>
>
> 20130725
>
> 16
>
> 26128576
>
> times taken 94.989 seconds
>
>
>
> select thedate,category_level1,category_level2,count(*) as cnt from
> r_rpt_cps_luna_item where thedate >='20130416' and thedate <'20130811'
> group by thedate,category_level1,category_level2 order by cnt desc limit
> 0,3
>
> _____
>
> totalRecords:10010
>
>
> thedate
>
> category_level1
>
> category_level2
>
> cnt
>
>
> 20130725
>
> 16
>
> 50010850
>
> 7315606
>
>
> 20130803
>
> 16
>
> 50010850
>
> 7006255
>
>
> 20130802
>
> 16
>
> 50010850
>
> 6936059
>
> times taken 288.885 seconds
>
>
>
>
>
> chinese introduce
> 1:mdrill旨在帮助用户在几秒到几十秒的时间内,分析百亿级别的任意维度组合的数
> 据。
> 2:mdrill是一个分布式的在线分析查询系统,基于hadoop,lucene,solr,jstorm等开源
> 系统作为实现,基于SQL的查询语法。 mdrill是一个能够对大量数据进行分布式处理的
> 软件框架。mdrill是快速的高性能的,他的底层因使用了索引、列式存储、以及内存
> cache等技 术,使得数据扫描的速度大为增加。mdrill是分布式的,它以并行的方式工
> 作,通过并行处理加快处理速度。
> 3:基于mdrill应用的adhoc项目,使用了10台机器,存储了400亿的数据
> ==>每次扫描30亿的行数,响应时间在20秒~120秒左右(取决不同的查询条件与扫描的
> 列数)。
> ==>对100亿数据进行count(*),耗时为2秒,单列sum耗时在25秒,按照日期分组求
> count和sum耗时47秒,按照用户id分组并且按照成交笔数排序去TopN 耗时 243秒。
>
>