You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Bryan Dennstedt <br...@phingers.com> on 2010/07/22 21:01:19 UTC

complex hive query thoughts.

So I'm new to hadoop and hive, but I have a 4 node cluster up and running,
the sample apps work great on it. But when I run my query, it doesn't seem
to actually "distribute" the compute to all the nodes.

This is one small portion of a larger query, but this seems to be the first
bottleneck in the larger one.

I have a table with data in it, has around 900k records in it.

I need to compare those records against themselves, and get all the records
where we find the min value between them. Or basically it generates 810m
records.

When I run this query on hive, it processes the map very quickly, and then
starts a reduce, but the reduce is only running on one node.

select c1.id as c1id,
       c2.id as c2id,
       c1.type,
       min(c1.value - c2.value) as c
from data_table c1
join data_table c2
on (
c1.year = c2.year and
c1.month = c2.month and
 c1.type = c2.type)
where c1.type = 1
group by c1.id,
 c2.id,
 c1.type

It quickly gets to 67% and then takes about 5 minutes or more per
percentage, I only let it get to about 70% before I kill it since it is at
that point slower than other options.

Is this a configuration issue? Not enough memory, disk space? Do I need more
nodes? Do I need to break it down into smaller chunks I guess first? Should
I not use hive and some other method? Hive/Sqoop worked beautifully to
import and do simple queries so far, albeit slower than mysql, so now we get
down to the heart of the matter.

I appreciate any wild guesses, or detailed thoughts. :) I'll try anything.

I have the same query running in JAVA memory alone completing in about 10
minutes, in PHP in 35 minutes.

I think hadoop/hive should be able to do it quicker than all of them, but
again since I'm so new, I don't know the right way to structure the query
properly is my guess.

Thanks!

-Bryan