You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2007/11/08 20:03:57 UTC
[Pig Wiki] Update of "PigPerformance" by OlgaN

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/PigPerformance

New page:
== Pig Performance ==

This document publishes current (as of 11/07/07) performance numbers for Pig. The objective is to have baseline numbers to compare to before we start making major changes to the system.

In addition, the document publishes numbers for Hadoop programs that perform identical computation. One of the objectives of Pig is to provide performance as close as possible (with 10-20%) to native hadoop code. These numbers allow to establish how wide the current gap is.

=== Test Setup ===

Both Pig and Hadoop tests ran on an 11 machine cluster with 1 machine dedicated to `Name Node` and `Job Tracker` and 10 compute nodes. Each machine had the following HW configuration:

   * 2 dual-core Intel(R) Xeon(R) CPU  @2.13GHz
   * 4 GB of memory

The cluster had Hadoop 0.14.1 installed and had the following configuration:

   *  1024MB memory
   * 2 map + 2 reduce jobs per node

=== Test Data ===

Two data sets were used. Both contained tab delimited auto-generated data with identical schema:

 * name - string
 * age - integer
 * gpa - float

The first dataset (`studenttab200M`) contained 200 million rows (4384624709 bytes) and was used for all tests. The second set (`studenttab10k`) contained 10 thousand rows (219190 bytes) and was used as the second set in cogroup/join.

The data can be generated using [ADD TOOL]

=== Test Cases ===

==== Load  and Store ====
{{{
A = load 'studenttab200M' using PigStorage('\t');
store A into my_studenttab200M using PigStorage();
}}}

==== Filter that removes 10% of data ====
{{{
A = load 'studenttab200M' using PigStorage('\t') as (name, age, gpa);
B = filter A by gpa < '3.6';
store B into my_studenttab200M_filter_10 using PigStorage();
}}}

==== Filter that removes 90% of data ====
{{{
A = load 'studenttab200M' using PigStorage('\t') as (name, age, gpa);
B = filter A by age < '25';
store B into my_studenttab200M_filter_90 using PigStorage();
}}}

==== Generate with  basic arithmetic ====
{{{
A = load 'studenttab200M' using PigStorage('\t') as (name, age, gpa);
B = foreach A generate age * gpa + 3, age/gpa - 1.5;
store B into my_studenttab200M_projection using PigStorage();
}}}
==== Grouping ====
{{{
A = load 'studenttab200M' using PigStorage('\t') as (name, age, gpa);
B = group A by name;
C = foreach B generate flatten(group), COUNT(A.age);
store C into my_studenttab200M_group using PigStorage();
}}}
==== Cogrouping/Join ====

{{{
A = load 'studenttab200M' using PigStorage('\t') as (name, age, gpa);
B = load 'studenttab10k' using PigStorage('\t') as (name, age, gpa);
C = cogroup A by name inner, B by name inner;
D = foreach C generate flatten(A), flatten(B);
store D into my_studenttab200M_cogroup_small using PigStorage();
}}}

=== Performance Numbers ===

The wall-clock time was measured for each pig script and hadoop job by using `time` command on the client. In addition, average CPU and memory utilization for both map and reduce stages were measured by observing `top`. All processes were run 3 times and the performance numbers averaged.

|| Test || Output Size(bytes) || Pig Time(s) || Map CPU(%) || Map Memory(MB) || Reduce CPU(%) || Reduce Memory(MB) || Hadoop Time(s) || Map CPU(%) || Map Memory(MB) || Reduce CPU(%) || Reduce Memory(MB) ||
|| !LoadStore || 4384624709 || 185 || 99 || 40 || N/A || N/A || 103 || 95 || 40 || N/A || N/A ||
|| Filter 10% || 3940641765 || 207 || 99 || 40 || N/A || N/A || 137 || 95 || 40 || N/A || N/A ||
|| Filter 90% || 511609335 || 170 || 99 || 40 || N/A || N/A || 100 || 95 || 40 || N/A || N/A ||
|| Project || ?? || 427 || 99 || 150 MB || N/A || N/A || 127 || 95 || 40 || N/A || N/A ||
|| Grouping || 14144 || 1072 || 99 || 400 || 99 || 750 || 180 || ?? || ?? || ?? || ?? ||
|| Cogroup || 129701186044 || 2480 || 99 || 450 || 50-90 || 500 || ?? || ?? || ?? || ?? || ?? ||

=== Additional Work ===

We need to do also do the following:

   * Add more tests for:
      * ORDER BY where data fits in memory
      * ORDER BY where data does not fit into memory
      * DISTINCT where data fits in memory
      * DISTINCT where data does not fit into memory
      * JOIN between 2 large tables
      * UDF overhead
   * Perform code profiling
      * by using tools
      * through code inspection