You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2010/06/16 20:22:15 UTC
[Pig Wiki] Update of "PigMix" by daijy

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "PigMix" page has been changed by daijy.
http://wiki.apache.org/pig/PigMix?action=diff&rev1=14&rev2=15

--------------------------------------------------

  PigMix is a set of queries used test pig performance from release to release.  There are queries that test latency (how long does it
  take to run this query?), and queries that test scalability (how many fields or records can pig handle before it fails?).  In addition
  it includes a set of map reduce java programs to run equivalent map reduce jobs directly.  These will be used to test the performance
- gap between direct use of map reduce and using pig.
+ gap between direct use of map reduce and using pig. In Jun 2010, we release PigMix2, which include 5 more queries in addition to
+ the original 12 queries into PigMix to measure the performance of new Pig features. We will publish the result of both PigMix and PigMix2.
  
  == Runs ==
+ === PigMix ===
  
  The following table includes runs done of the pig mix.  All of these runs have been done on a cluster with 26 slaves plus one machine acting as the name node and job tracker.  The cluster was running 
  hadoop version 0.18.1.  (TODO:  Need to get specific hardware info on those machines).  
@@ -140, +142 @@

  || Total     || 1407         || 1362.33       || 1.03       ||
  || Weighted Avg ||           ||               || 1.09       ||
  
+ === PigMix2 ===
+ Run date:  May 29, 2010, run against top of trunk as of that day.
+ || Test      || Pig run time || Java run time || Multiplier ||
+ || PigMix_1  || 122.33       || 117           || 1.05       ||
+ || PigMix_2  || 50.33        || 42.67         || 1.18       ||
+ || PigMix_3  || 189          || 100.33        || 1.88       ||
+ || PigMix_4  || 75.67        || 61            || 1.24       ||
+ || PigMix_5  || 64           || 138.67        || 0.46       ||
+ || PigMix_6  || 65.67        || 69.33         || 0.95       ||
+ || PigMix_7  || 88.33        || 84.33         || 1.05       ||
+ || PigMix_8  || 39           || 47.67         || 0.82       ||
+ || PigMix_9  || 274.33       || 215.33        || 1.27       ||
+ || PigMix_10 || 333.33       || 311.33        || 1.07       ||
+ || PigMix_11 || 151.33       || 157           || 0.96       ||
+ || PigMix_12 || 70.67        || 97.67         || 0.72       ||
+ || PigMix_13 || 80           || 33            || 2.42       ||
+ || PigMix_14 || 69           || 86.33         || 0.80       ||
+ || PigMix_15 || 80.33        || 69.33         || 1.16       ||
+ || PigMix_16 || 82.33        || 69.33         || 1.19       ||
+ || PigMix_17 || 286          || 229.33        || 1.25       ||
+ || Total     || 2121.67      || 1929.67       || 1.10       ||
+ ||Weighted Avg ||  1.14544       ||
  
  
  == Features Tested ==
@@ -160, +184 @@

   1. union plus distinct
   1. order by
   1. multi-store query (that is, a query where data is scanned once, then split and grouped different ways).
+  1. outer join
+  1. merge join
+  1. multiple distinct aggregates
+  1. accumulative mode
  
  The data is generated so that it has a zipf type distribution for the group by and join keys, as this models most human generated
  data.
@@ -207, +235 @@

  between key value pairs and Ctrl-D between keys and values.  Bags in the file are delimited by Ctrl-B between tuples in the bag.
  A special loader, !PigPerformance loader has been written to read this format. 
  
+ PigMix2 include 4 more data set, which can be derived from the original dataset:
+ {{{
+ A = load 'page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
+     as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
+ B = order A by user parallel $mappers;
+ store B into 'page_views_sorted' using PigStorage('\u0001');
+ 
+ alpha = load 'users' using PigStorage('\u0001') as (name, phone, address, city, state, zip);
+ a1 = order alpha by name parallel $mappers;
+ store a1 into 'users_sorted' using PigStorage('\u0001');
+ 
+ a = load 'power_users' using PigStorage('\u0001') as (name, phone, address, city, state, zip);
+ b = sample a 0.5;
+ store b into 'power_users_samples' using PigStorage('\u0001');
+ 
+ A = load 'page_views' as (user, action, timespent, query_term, ip_addr, timestamp,
+         estimated_revenue, page_info, page_links);
+ B = foreach A generate user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links,
+ user as user1, action as action1, timespent as timespent1, query_term as query_term1, ip_addr as ip_addr1, timestamp as timestamp1, estimated_revenue as estimated_revenue1, page_info as page_info1, page_links as page_links1,
+ user as user2, action as action2, timespent as timespent2, query_term as query_term2, ip_addr as ip_addr2, timestamp as timestamp2, estimated_revenue as estimated_revenue2, page_info as page_info2, page_links as page_links2;
+ store B into 'widegroupbydata';
+ }}} 
+ 
  == Proposed Scripts ==
  
  === Scalability ===
@@ -415, +466 @@

  This script covers multi-store queries (feature 16).
  {{{
  register pigperf.jar;
- A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader()
+ A = load '$page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
      as (user, action, timespent, query_term, ip_addr, timestamp,
          estimated_revenue, page_info, page_links);
  B = foreach A generate user, action, (int)timespent as timespent, query_term,
@@ -433, +484 @@

  store gimel into 'queries_per_action';
  }}}
  
+ '''Script L13 (PigMix2 only)'''
+ 
+ This script covers outer join (feature 17).
+ {{{
+ register pigperf.jar;
+ A = load 'page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
+         as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
+ B = foreach A generate user, estimated_revenue;
+ alpha = load ':INPATH:/pigmix/power_users_samples' using PigStorage('\\u0001') as (name, phone, address, city, state, zip);
+ beta = foreach alpha generate name, phone;
+ C = join B by user left outer, beta by name $parallelfactor;
+ store C into '$out'
+ }}}
+ 
+ '''Script L14 (PigMix2 only)'''
+ 
+ This script covers merge join (feature 18).
+ {{{
+ register pigperf.jar;
+ A = load 'page_views_sorted' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
+     as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
+ B = foreach A generate user, estimated_revenue;
+ alpha = load 'users_sorted' using PigStorage('\\u0001') as (name, phone, address, city, state, zip);
+ beta = foreach alpha generate name;
+ C = join B by user, beta by name using "merge";
+ store C into '$out';
+ }}}
+ 
+ '''Script L15 (PigMix2 only)'''
+ 
+ This script covers multiple distinct aggregates (feature 19).
+ {{{
+ register pigperf.jar;
+ A = load 'page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
+     as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
+ B = foreach A generate user, action, estimated_revenue, timespent;
+ C = group B by user parallel 40;
+ D = foreach C {
+     beth = distinct B.action;
+     rev = distinct B.estimated_revenue;
+     ts = distinct B.timespent;
+     generate group, COUNT(beth), SUM(rev), (int)AVG(ts);
+ }
+ store D into '$out';
+ }}}
+ 
+ '''Script L16 (PigMix2 only)'''
+ 
+ This script covers accumulative mode (feature 20).
+ {{{
+ register pigperf.jar;
+ A = load 'page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
+     as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
+ B = foreach A generate user, estimated_revenue;
+ C = group B by user parallel 40;
+ D = foreach C {
+     E = order B by estimated_revenue;
+     F = E.estimated_revenue;
+     generate group, SUM(F);
+ }
+ store D into '$out';
+ }}}
+ 
+ '''Script L17 (PigMix2 only)'''
+ 
+ This script covers wide key group (feature 12).
+ {{{
+ register pigperf.jar;
+ A = load 'widegroupbydata' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
+     as (user, action, timespent, query_term, ip_addr, timestamp,
+         estimated_revenue, page_info, page_links, user_1, action_1, timespent_1, query_term_1, ip_addr_1, timestamp_1,
+         estimated_revenue_1, page_info_1, page_links_1, user_2, action_2, timespent_2, query_term_2, ip_addr_2, timestamp_2,
+         estimated_revenue_2, page_info_2, page_links_2);
+ B = group A by (user, action, timespent, query_term, ip_addr, timestamp,
+         estimated_revenue, user_1, action_1, timespent_1, query_term_1, ip_addr_1, timestamp_1,
+         estimated_revenue_1, user_2, action_2, timespent_2, query_term_2, ip_addr_2, timestamp_2,
+         estimated_revenue_2) parallel 40;
+ C = foreach B generate SUM(A.timespent), SUM(A.timespent_1), SUM(A.timespent_2), AVG(A.estimated_revenue), AVG(A.estimated_revenue_1), AVG(A.estimated_revenue_2);
+ store C into '$out';
+ }}}
+ 
+ 
- Features not yet covered:  5 (bzip data), 8 (sorted join)
+ Features not yet covered:  5 (bzip data)
  
  == Data Generation ==