You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Lai Will <la...@student.ethz.ch> on 2011/04/20 21:57:39 UTC

Benchmark Haddop and Pig UDFs

Hi there,

I'm planning to do some performance measurements of my hadoop pig code in order to see how it scales.
Does anyone have some suggestions on how to do that?

I thought of measuring the time needed for completion on a fixed cluster size by increasing the input data.
Then by fixing the input data and by adding cluster nodes. Does anyone have experience in doing that? I thought of writing a script that does start/stop the time and execute the pig command. Maybe there's a better way?

Best,
Will

Re: Benchmark Haddop and Pig UDFs

Posted by Guy Bayes <fa...@gmail.com>.

One thing I would say is don't benchmark on EC2, do it on physical
hardware...

There is a test harness infrastructure for generic benchmarking at
http://bbltest.sourceforge.net/

that might be somewhat useful
Guy

On Wed, Apr 20, 2011 at 2:19 PM, Lai Will <la...@student.ethz.ch> wrote:

> My goal is to show that hadoop can be used for a certain use case.
> I don't need to compare the different usage forms of hadoop.
>
> So your second hint, is pretty much what I thought of doing.
>
> Do you or does anyone else already have experience in doing that?
> What technologies did you use in order to achieve that? bash script?
> python?
> How would you set up the benchmark?
>
> Best,
> Will
>
> -----Original Message-----
> From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com]
> Sent: Mittwoch, 20. April 2011 23:13
> To: user@pig.apache.org
> Cc: Lai Will
> Subject: Re: Benchmark Haddop and Pig UDFs
>
>
> Not sure what the scope of the experiment is, but some useful comparisons
> could be against :
> a) job using only mapred api.
> b) hadoop streaming.
> c) pig streaming.
>
> It also depends on the actual script/job being run - if it is using
> combiners, multiple outputs, 'depth of pipeline', how many jobs you end up
> running for it, etc.
>
>
>
> If you are interested in only testing how pig scales, then interesting
> metrics could be :
> a) size of input.
> b) with/without compression.
> c) number of mappers.
> d) number of reducers.
> e) output size (depending on what you are running I guess).
>
>
> Regards,
> Mridul
>
>
> On Thursday 21 April 2011 01:27 AM, Lai Will wrote:
> > Hi there,
> >
> > I'm planning to do some performance measurements of my hadoop pig code in
> order to see how it scales.
> > Does anyone have some suggestions on how to do that?
> >
> > I thought of measuring the time needed for completion on a fixed cluster
> size by increasing the input data.
> > Then by fixing the input data and by adding cluster nodes. Does anyone
> have experience in doing that? I thought of writing a script that does
> start/stop the time and execute the pig command. Maybe there's a better way?
> >
> > Best,
> > Will
>
>

RE: Benchmark Haddop and Pig UDFs

Posted by Lai Will <la...@student.ethz.ch>.

My goal is to show that hadoop can be used for a certain use case.
I don't need to compare the different usage forms of hadoop.

So your second hint, is pretty much what I thought of doing.

Do you or does anyone else already have experience in doing that?
What technologies did you use in order to achieve that? bash script? python?
How would you set up the benchmark?

Best,
Will

-----Original Message-----
From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com] 
Sent: Mittwoch, 20. April 2011 23:13
To: user@pig.apache.org
Cc: Lai Will
Subject: Re: Benchmark Haddop and Pig UDFs

Not sure what the scope of the experiment is, but some useful comparisons could be against :
a) job using only mapred api.
b) hadoop streaming.
c) pig streaming.

It also depends on the actual script/job being run - if it is using combiners, multiple outputs, 'depth of pipeline', how many jobs you end up running for it, etc.

If you are interested in only testing how pig scales, then interesting metrics could be :
a) size of input.
b) with/without compression.
c) number of mappers.
d) number of reducers.
e) output size (depending on what you are running I guess).

Regards,
Mridul

On Thursday 21 April 2011 01:27 AM, Lai Will wrote:
> Hi there,
>
> I'm planning to do some performance measurements of my hadoop pig code in order to see how it scales.
> Does anyone have some suggestions on how to do that?
>
> I thought of measuring the time needed for completion on a fixed cluster size by increasing the input data.
> Then by fixing the input data and by adding cluster nodes. Does anyone have experience in doing that? I thought of writing a script that does start/stop the time and execute the pig command. Maybe there's a better way?
>
> Best,
> Will

Re: Benchmark Haddop and Pig UDFs

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

Not sure what the scope of the experiment is, but some useful 
comparisons could be against :
a) job using only mapred api.
b) hadoop streaming.
c) pig streaming.

It also depends on the actual script/job being run - if it is using 
combiners, multiple outputs, 'depth of pipeline', how many jobs you end 
up running for it, etc.

If you are interested in only testing how pig scales, then interesting 
metrics could be :
a) size of input.
b) with/without compression.
c) number of mappers.
d) number of reducers.
e) output size (depending on what you are running I guess).

Regards,
Mridul

On Thursday 21 April 2011 01:27 AM, Lai Will wrote:
> Hi there,
>
> I'm planning to do some performance measurements of my hadoop pig code in order to see how it scales.
> Does anyone have some suggestions on how to do that?
>
> I thought of measuring the time needed for completion on a fixed cluster size by increasing the input data.
> Then by fixing the input data and by adding cluster nodes. Does anyone have experience in doing that? I thought of writing a script that does start/stop the time and execute the pig command. Maybe there's a better way?
>
> Best,
> Will