You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Awhan Patnaik <aw...@spotzot.com> on 2015/12/17 08:25:41 UTC

increase number of reducers

3 node cluster with 15 gigs of RAM per node. Two tables L is approximately
1 Million rows, U is 100 Million. They both have latitude and longitude
columns. I want to find the count of rows in U that are within a 10 mile
radius of each of the row in L.

I have indexed the latitude and longitude columns in U. U is date wise
partitioned. U and L are both stored in ORC Snappy file format.

My query is like this:

select l.id, count(u.id) from L l, U u
where
u.lat !=0 and
u.lat > l.lat - 10/69 and u.lat < l.lat + 10/69 and
u.lon > l.lon - ( 10 / ( 69 * cos(radians(l.lat)) ) ) and
v.lon < l.lon + ( 10 / ( 69 * cos(radians(l.lat)) ) ) and
3960 *acos(cos(radians(l.lat)) * cos(radians(u.lat)) * cos(radians(l.lon)
- radians(u.lon)) + sin(radians(l.lat)) * sin(radians(u.lat))) < 10.0
group by l.id;

The conditions in the where part enforce a bounding box filtering
constraint based on lat/long values.

The problem is that this results in 9 mappers but only 1 reducer. I notice
that the job gets stuck at the 67% of the reduce phase. When I run htop I
find that 2 of the nodes are sitting idle while the third node is busy
running the single reduce task.

I tried using "set mapreduce.job.reduces=50;" but that did not help as the
number of reduce jobs was deduced to be 1 during compile time.

How do I force more reducers?

Re: increase number of reducers

Posted by Muni Chada <mc...@pivotal.io>.

Is this table bucketed? If so, please set the number of reducers (set
mapreduce.job.reduces=bucket_size) to match to the table's bucket size.

On Thu, Dec 17, 2015 at 1:25 AM, Awhan Patnaik <aw...@spotzot.com> wrote:

> 3 node cluster with 15 gigs of RAM per node. Two tables L is approximately
> 1 Million rows, U is 100 Million. They both have latitude and longitude
> columns. I want to find the count of rows in U that are within a 10 mile
> radius of each of the row in L.
>
> I have indexed the latitude and longitude columns in U. U is date wise
> partitioned. U and L are both stored in ORC Snappy file format.
>
> My query is like this:
>
> select l.id, count(u.id) from L l, U u
> where
> u.lat !=0 and
> u.lat > l.lat - 10/69 and u.lat < l.lat + 10/69 and
> u.lon > l.lon - ( 10 / ( 69 * cos(radians(l.lat)) ) ) and
> v.lon < l.lon + ( 10 / ( 69 * cos(radians(l.lat)) ) ) and
> 3960 *acos(cos(radians(l.lat)) * cos(radians(u.lat)) * cos(radians(l.lon)
> - radians(u.lon)) + sin(radians(l.lat)) * sin(radians(u.lat))) < 10.0
> group by l.id;
>
> The conditions in the where part enforce a bounding box filtering
> constraint based on lat/long values.
>
> The problem is that this results in 9 mappers but only 1 reducer. I notice
> that the job gets stuck at the 67% of the reduce phase. When I run htop I
> find that 2 of the nodes are sitting idle while the third node is busy
> running the single reduce task.
>
> I tried using "set mapreduce.job.reduces=50;" but that did not help as the
> number of reduce jobs was deduced to be 1 during compile time.
>
> How do I force more reducers?
>