You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Ashutosh Chauhan <as...@gmail.com> on 2009/02/14 08:09:38 UTC

Pig Performance Benchmarks

Hi Alan & Others,

I am using pigmix patch at:
https://issues.apache.org/jira/browse/PIG-200 and want to generate
test data and run pigmix queries on it. As I understand, shell scripts
in the patch are intended to generate data for pigmix queries.
I have been able to adapt the shell scripts, map-reduce jobs and
pigmix queries on our cluster environment. Faced few problems because
of hard-coded paths, but resolved most issues. Still having one
confusion though. I believe there is one to one correspondence between
test data files generated by shell script and files loaded by pig
queries. So, wanted to verify if that is the case. According to my
understanding, correspondence is as follows:

generate_data.sh             pigmix
=============================
page_views          ->         pages10m
widerow               ->         widerow1m
power_users        ->         power_users, power_users10m (either
could be used? )
users                  ->          users, users10m
(either could be used? )

Is my understanding correct? Since data generated is random, could not
verify manually by checking schema inside files.

Thanks,
Ashutosh

Re: Pig Performance Benchmarks

Posted by Alan Gates <ga...@yahoo-inc.com>.

That's correct.  The 10m in the names weren't really meant to be  
hardcoded into the patch, as the idea is that the tables could be  
created at different sizes depending on your cluster size.  Sorry for  
the incomplete state of things, obviously that patch needs some work  
before I commit it.

Alan.

On Feb 13, 2009, at 11:09 PM, Ashutosh Chauhan wrote:

> Hi Alan & Others,
>
> I am using pigmix patch at:
> https://issues.apache.org/jira/browse/PIG-200 and want to generate
> test data and run pigmix queries on it. As I understand, shell scripts
> in the patch are intended to generate data for pigmix queries.
> I have been able to adapt the shell scripts, map-reduce jobs and
> pigmix queries on our cluster environment. Faced few problems because
> of hard-coded paths, but resolved most issues. Still having one
> confusion though. I believe there is one to one correspondence between
> test data files generated by shell script and files loaded by pig
> queries. So, wanted to verify if that is the case. According to my
> understanding, correspondence is as follows:
>
> generate_data.sh             pigmix
> =============================
> page_views          ->         pages10m
> widerow               ->         widerow1m
> power_users        ->         power_users, power_users10m (either
> could be used? )
> users                  ->          users, users10m
> (either could be used? )
>
> Is my understanding correct? Since data generated is random, could not
> verify manually by checking schema inside files.
>
> Thanks,
> Ashutosh