You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Alexander Alexandrov <al...@gmail.com> on 2011/07/23 03:25:43 UTC

Hive Optimization Process

Hello Hive developers,

I study informatics at the TU Berlin. As part of my master's thesis a couple
of months ago I conducted some experiments on Hive (version 0.7.0).
Unfortunately, I don't have access to some critical compile-time data, so I
was hoping you can help me make some sense of some numbers by filling the
gaps or pointing me to sources with information regarding the way the Hive
optimizer works.

For all generated datasets, I used a plain TEXTFILE storage format without a
table-space partitioning (that is, my data was stored in the HDFS inside
large logical CSV like files). I defined a set of minimalistic queries (up
to 4 logical operators working on 1-2 input relations) organized into
distinct subsets depending on the nature of the performed operations (e.g.
embarrassingly parallel queries, parallel aggregation queries, parallel
joins, etc.). For all queries, I performed experiments with increasing
selectivity factor to see how the system reacts to an increase in the amount
of processed data.

I will try to illustrate my questions with short abstract examples -
consider the following general query with a range based filter and an
additive aggregation function:

SELECT   x, AVG(y)
FROM     A
WHERE    z < {alpha}
GROUP BY x;

The group key column x has a relatively small number of distinct values,
i.e. there are few reduce groups with high number of records in each one of
them. I used increasing {alpha} values to range the selectivity of A from
0.4 to 1.0, but I don't see a corresponding increase in the amount of output
records emitted by the mappers. *Since the value of the
"COMBINE_OUTPUT_RECORDS" counter for all jobs is also zero, I assume that
you somehow push the combine logic inside the Hive-subplan executed by the
mapper - is that correct?*

Similar observations occur if I substitute the additive aggregation function
(AVG) with a built-in "holistic" one (PERCENTILE):

set hive.map.aggr.hash.percentmemory=0.05;
set mapred.child.java.opts=-Xmx24576m;

SELECT   x, PERCENTILE(y, array(0.25, 0.50, 0.75))
FROM     A
WHERE    z < {alpha}
GROUP BY x;

Again, log stats suggest that there is no Hadoop combiner (I can understand
that in this case), although the number of map output records is the same as
in the previous case. This suggests that somehow (maybe because there are
only 10 distinct groups) the aggregated y-values can be compacted into a
single record per group inside the mapper - I assume this is done by the
hashtable optimization -- *can someone briefly explain when and how exactly
the hash optimization in the PERCENTILE function works?*

I'll be very grateful to any sort of help!

Regards,
Alexander Alexandrov