You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Benjamin Jakobus <ja...@gmail.com> on 2013/08/27 08:44:25 UTC
Pig vs Hive: GROUP BY

Hi all,

I am trying to understand the difference between how Pig implements the
Group By operator and how Hive does it. My hypothesis is that Pig, being a
procedural and lazy language and hence creates a aliases for each "stage"
in the execution.


Background:
After benchmarking Hive and Pig, I found that the Group By operator in Pig
is drastically slower that Hive's. I was wondering whether anybody has
experienced the same? And whether people may have any tips for improving
the performance of this operation? (Adding a DISTINCT as suggested by an
earlier post on here doesn't help. I am currently re-running the benchmark
with LZO compression enabled).

For the given Group By, Pig launches 99 mappers, Hive only 8.

I have been running 3 different types of queries.

The first was performed on datasets of 6 different sizes:


   - Dataset size 1: 30,000 records (772KB)
   - Dataset size 2: 300,000 records (6.4MB)
   - Dataset size 3: 3,000,000 records (63MB)
   - Dataset size 4: 30 million records (628MB)
   - Dataset size 5: 300 million records (6.2GB)
   - Dataset size 6: 3 billion records (62GB)

The datasets scale linearly, whereby the size equates to 3000 * 10n .
A seventh dataset consisting of 1,000 records (23KB) was produced to
perform join
operations on. Its schema is as follows:
name - string
marks - integer
gpa - float
The data was generated using the generate data.pl perl script available for
download
 from https://issues.apache.org/jira/browse/PIG-200 to produce the
datasets. The results are as follows:


 *      * *      * *      * *Set 1      * *Set 2**      * *Set 3**      * *Set
4**      * *Set 5**      * *Set 6*
*Arithmetic**      * 32.82*      * 36.21*      * 49.49*      * 83.25*      *
 423.63*      * 3900.78
*Filter 10%**      * 32.94*      * 34.32*      * 44.56*      * 66.68*      *
 295.59*      * 2640.52
*Filter 90%**      * 33.93*      * 32.55*      * 37.86*      * 53.22*      *
 197.36*      * 1657.37
*Group**      * *      *49.43*      * 53.34*      * 69.84*      * 105.12*
   *497.61*      * 4394.21
*Join**      * *      *   49.89*      * 50.08*      * 78.55*      * 150.39*
   *1045.34*     *10258.19
*Averaged performance of arithmetic, join, group, order, distinct select
and filter operations on six datasets using Pig. Scripts were configured as
to use 8 reduce and 11 map tasks.*



 *      * *              Set 1**      * *Set 2**      * *Set 3**      * *Set
4**      * *Set 5**      * *Set 6*
*Arithmetic**      *  32.84*      * 37.33*      * 72.55*      * 300.08
 2633.72    27821.19
*Filter 10%      *   32.36*      * 53.28*      * 59.22*      * 209.5*    *
1672.3*     *18222.19
*Filter 90%      *  31.23*      * 32.68*      *  36.8*      *  69.55*      *
331.88*     *3320.59
*Group      * *      * 48.27*      * 47.68*      * 46.87*      * 53.66*
 *141.36*     *1233.4
*Join      * *      * *   *48.54*      *56.86*      * 104.6*      * 517.5*
   * 4388.34*      * -
*Distinct**      * *     *48.73*      *53.28*      * 72.54*      * 109.77*
   * - *      * *      *  -
*Averaged performance of arithmetic, join, group, distinct select and
filter operations on six datasets using Hive. Scripts were configured as to
use 8 reduce and 11 map tasks.*

(If you want to see the standard deviation, let me know).

So, to summarize the results: Pig outperforms Hive, with the exception of
using *Group By*.

The Pig scripts used for this benchmark are as follows:
*Arithmetic*
-- Generate with basic arithmetic
A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age,
gpa) PARALLEL $reducers;
B = foreach A generate age * gpa + 3, age/gpa - 1.5 PARALLEL $reducers;
store B into '$output/dataset_300000000_projection' using PigStorage()
PARALLEL $reducers;

*
*
*Filter 10%*
-- Filter that removes 10% of data
A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age,
gpa) PARALLEL $reducers;
B = filter A by gpa < '3.6' PARALLEL $reducers;
store B into '$output/dataset_300000000_filter_10' using PigStorage()
PARALLEL $reducers;


*Filter 90%*
-- Filter that removes 90% of data
A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age,
gpa) PARALLEL $reducers;
B = filter A by age < '25' PARALLEL $reducers;
store B into '$output/dataset_300000000_filter_90' using PigStorage()
PARALLEL $reducers;

*
*
*Group*
A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age,
gpa) PARALLEL $reducers;
B = group A by name PARALLEL $reducers;
C = foreach B generate flatten(group), COUNT(A.age) PARALLEL $reducers;
store C into '$output/dataset_300000000_group' using PigStorage() PARALLEL
$reducers;
*
*
*Join*
A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age,
gpa) PARALLEL $reducers;
B = load '$input/dataset_join' using PigStorage('\t') as (name, age, gpa)
PARALLEL $reducers;
C = cogroup A by name inner, B by name inner PARALLEL $reducers;
D = foreach C generate flatten(A), flatten(B) PARALLEL $reducers;
store D into '$output/dataset_300000000_cogroup_big' using PigStorage()
PARALLEL $reducers;

Similarly, here the Hive scripts:
*Arithmetic*
SELECT (dataset.age * dataset.gpa + 3) AS F1, (dataset.age/dataset.gpa -
1.5) AS F2
FROM dataset
WHERE dataset.gpa > 0;

*Filter 10%*
SELECT *
FROM dataset
WHERE dataset.gpa < 3.6;

*Filter 90%*
SELECT *
FROM dataset
WHERE dataset.age < 25;

*Group*
SELECT COUNT(dataset.age)
FROM dataset
GROUP BY dataset.name;

*Join*
SELECT *
FROM dataset JOIN dataset_join
ON dataset.name = dataset_join.name;

I will re-run the benchmarks to see whether it is the reduce or map side
that is slower and get back to you later today.

The other two benchmarks were slightly different: I performed transitive
self joins in which Pig outperformed Hive. However once I added a Group By,
Hive began outperforming Pig.

I also ran the TPC-H benchmarks and noticed that Hive (surprisingly)
outperformed Pig. However what *seems* to cause the actual performance
difference is the heavy usage of the Group By operator in all but 3 TPC-H
test scripts.

Re-running the scripts whilst omitting the the grouping of data produces
the expected results. For example, running script 3
(q3_shipping_priority.pig) whilst omitting the Group By operator
significantly reduces the runtime (to 1278.49 seconds real time runtime or
a total of 12,257,630ms CPU time).

The fact that the Group By operator skews the TPC-H benchmark in favour of
Apache Hive is supported by further experiments: as noted earlier a
benchmark was carried out on a transitive self-join. The former took Pig an
average of 45.36 seconds (real time runtime) to execute; it took Hive 56.73
seconds. The latter took  Pig 157.97 and Hive 180.19 seconds (again, on
average). However adding the Group By operator to the scripts turned the
tides: Pig is now significantly slower than Hive, requiring an average of
278.15 seconds. Hive on the other hand required only 204.01 to perform the
JOIN and GROUP operations.

Real time runtime is measured using the time -p command.

Disabling the combiner made things even slower.

Regards,
Ben