You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2008/10/02 21:57:40 UTC

Call for queries

I propose that pig develop a standard set of benchmark queries that can 
be run from release to release to measure pig's (hopefully improving) 
performance over time.  This would be similar in nature to hadoop's 
GridMix (see 
http://svn.apache.org/viewvc/hadoop/core/tags/release-0.17.1/src/test/gridmix/ 
and http://developer.yahoo.com/blogs/hadoop/).  This set should be 
relatively small (probably under 10).  But it should cover a range of 
operations being done by pig users.

So, if you have queries that you think would be good candidates and that 
you can share (or obfuscate and then share), please do so.  In addition 
to the query, please give some idea of the type of data it runs over.  
In particular we need to know how much data, how many fields are in your 
data, the cardinality and distribution of any fields used as a group, 
cogroup, or sort key.

Thanks.