You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by 老赵 <la...@sina.cn> on 2014/12/08 06:12:17 UTC

hive sql tune

Hello,I am working for a Telecommunicaton Service Provider Company,So I can access the view logs of different users from a specific area.Now I want to query the top 1000 PV sites.I wrote a UDF named : parse_top_domain to get the top domain of a host,like www1.google.com.hk -> google.com.hkand i use the below hql:add jar hive_func.jar;create temporary function parse_top_domain as 'com.xxx.GetTopLevelDomain';select parse_top_domain(parse_url(url,'HOST')),count(*) c from src_clickwhere date = 20141204and parse_top_domain(parse_url(url,'HOST')) !=''group by parse_top_domain(parse_url(url,'HOST'))order by c desc;The dataformat in src_click is org.apache.hadoop.mapred.TextInputFormat and has been compressed .This hql will generate 8 mappers and 1 reducer,for the data is very big ,it is very slow .I hope it can be generate much more mappers so I set this :set mapred.map.tasks=100;But this has no effect.So any one can help me or give some suggestions .Any replay is appreciated.--------------------------------ZHAOlaozhao0@sina.cn

Re: hive sql tune

Posted by Stéphane Verlet <ka...@gmail.com>.

You can reduce mapreduce.input.fileinputformat.split.maxsize  to increase
the number of mappers (more splits).

However your issue is likely due as David alluded, to the compression.
Depending on how your files are organized and compressed , Hadoop my not be
able to split them to feed several mappers :
https://cwiki.apache.org/confluence/display/Hive/CompressedStorage

Stephane

On Sun, Dec 7, 2014 at 10:50 PM, david1990111@163.com <da...@163.com>
wrote:

> You metioned that 'The dataformat in src_click is org.apache.hadoop.mapred.TextInputFormat
> and has been compressed .'
> Can I ask you that which compression codec do you use?
>
>
> *发件人：* 老赵 <la...@sina.cn>
> *发送时间：* 2014-12-08 13:12
> *收件人：* user <us...@hive.apache.org>
> *主题：* hive sql tune
>
> Hello,
>
> I am working for a Telecommunicaton Service Provider Company,So I can
> access the view logs of different users from a specific area.
>
> Now I want to query the top 1000 PV sites.
>
> I wrote a UDF named : parse_top_domain to get the top domain of a
> host,like www1.google.com.hk -> google.com.hk
>
> and i use the below hql:
>
> add jar hive_func.jar;
>
> create temporary function parse_top_domain as 'com.xxx.GetTopLevelDomain';
>
> select parse_top_domain(parse_url(url,'HOST')),count(*) c from
>
> src_click
>
> where date = 20141204
>
> and parse_top_domain(parse_url(url,'HOST')) !=''
>
> group by parse_top_domain(parse_url(url,'HOST'))
>
> order by c desc;
>
> The dataformat in src_click is org.apache.hadoop.mapred.TextInputFormat
> and has been compressed .
>
> This hql will generate 8 mappers and 1 reducer,for the data is very big
> ,it is very slow .
>
> I hope it can be generate much more mappers so I set this :set
> mapred.map.tasks=100;
>
> But this has no effect.
>
> So any one can help me or give some suggestions .
>
> Any replay is appreciated.
>
> --------------------------------
>
> ZHAO
>
> laozhao0@sina.cn
>
>

回复: hive sql tune

Posted by "david1990111@163.com" <da...@163.com>.

You metioned that 'The dataformat in src_click is org.apache.hadoop.mapred.TextInputFormat and has been compressed .'
Can I ask you that which compression codec do you use?
 
发件人： 老赵
发送时间： 2014-12-08 13:12
收件人： user
主题： hive sql tune
Hello,
I am working for a Telecommunicaton Service Provider Company,So I can access the view logs of different users from a specific area.
Now I want to query the top 1000 PV sites.
I wrote a UDF named : parse_top_domain to get the top domain of a host,like www1.google.com.hk -> google.com.hk
and i use the below hql:
add jar hive_func.jar;
create temporary function parse_top_domain as 'com.xxx.GetTopLevelDomain';
select parse_top_domain(parse_url(url,'HOST')),count(*) c from 
src_click
where date = 20141204
and parse_top_domain(parse_url(url,'HOST')) !=''
group by parse_top_domain(parse_url(url,'HOST'))
order by c desc;
The dataformat in src_click is org.apache.hadoop.mapred.TextInputFormat and has been compressed .
This hql will generate 8 mappers and 1 reducer,for the data is very big ,it is very slow .
I hope it can be generate much more mappers so I set this :set mapred.map.tasks=100;
But this has no effect.
So any one can help me or give some suggestions .
Any replay is appreciated.
--------------------------------
ZHAO
laozhao0@sina.cn