You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Shrinivas Joshi <js...@gmail.com> on 2011/02/18 22:32:07 UTC

benchmark choices

Which workloads are used for serious benchmarking of Hadoop clusters? Do you
care about any of the following workloads :
TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench,
sample apps shipped with Hadoop distro like PiEstimator, dbcount etc.

Thanks,
-Shrinivas

Re: benchmark choices

Posted by Ted Dunning <td...@maprtech.com>.

MalStone looks like a very narrow benchmark.

Terasort is also a very narrow and somewhat idiosyncratic benchmark, but it
has the characteristic that lots of people use it.

You should add PigMix to your list.  There java versions of the problems in
PigMix that make a pretty good set of benchmarks independent of Pig itself.

On Fri, Feb 18, 2011 at 1:32 PM, Shrinivas Joshi <js...@gmail.com>wrote:

> Which workloads are used for serious benchmarking of Hadoop clusters? Do
> you
> care about any of the following workloads :
> TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench,
> sample apps shipped with Hadoop distro like PiEstimator, dbcount etc.
>
> Thanks,
> -Shrinivas
>

Re: benchmark choices

Posted by Konstantin Boudnik <co...@apache.org>.

Adding Roman Shaposhnik to the list who's "tasked" with benchmarking @Cloudera

On Mon, Feb 21, 2011 at 12:39, Shrinivas Joshi <js...@gmail.com> wrote:
> I wonder what companies like Amazon, Cloudera, RackSpace, Facebook, Yahoo
> etc. look at for the purpose of benchmarking. I guess GridMix v3 might be of
> more interest to Yahoo.
>
> I would appreciate if someone can comment more on this.
>
> Thanks,
> -Shrinivas
>
> On Fri, Feb 18, 2011 at 4:50 PM, Konstantin Boudnik <co...@apache.org> wrote:
>>
>> On Fri, Feb 18, 2011 at 14:35, Ted Dunning <td...@maprtech.com> wrote:
>> > I just read the malstone report.  They report times for a Java version
>> > that
>> > is many (5x) times slower than for a streaming implementation.  That
>> > single
>> > fact indicates that the Java code is so appallingly bad that this is a
>> > very
>> > bad benchmark.
>>
>> Slow Java code? That's funny ;) Running with Hotspot on by any chance?
>>
>> > On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout
>> > <ji...@pervasive.com>wrote:
>> >
>> >> We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the
>> >> data and the queries, if not the query generator. There is a Jira issue
>> >> in
>> >> Hive that discusses the TPC-H "benchmark" if you're interested. Sorry,
>> >> I
>> >> don't remember the issue number offhand.
>> >>
>> >> -----Original Message-----
>> >> From: Shrinivas Joshi [mailto:jshrinivas@gmail.com]
>> >> Sent: Friday, February 18, 2011 3:32 PM
>> >> To: common-user@hadoop.apache.org
>> >> Subject: benchmark choices
>> >>
>> >> Which workloads are used for serious benchmarking of Hadoop clusters?
>> >> Do
>> >> you care about any of the following workloads :
>> >> TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench,
>> >> NNBench,
>> >> sample apps shipped with Hadoop distro like PiEstimator, dbcount etc.
>> >>
>> >> Thanks,
>> >> -Shrinivas
>> >>
>> >>
>> >
>
>

Re: benchmark choices

Posted by Shrinivas Joshi <js...@gmail.com>.

I wonder what companies like Amazon, Cloudera, RackSpace, Facebook, Yahoo
etc. look at for the purpose of benchmarking. I guess GridMix v3 might be of
more interest to Yahoo.

I would appreciate if someone can comment more on this.

Thanks,
-Shrinivas

On Fri, Feb 18, 2011 at 4:50 PM, Konstantin Boudnik <co...@apache.org> wrote:

> On Fri, Feb 18, 2011 at 14:35, Ted Dunning <td...@maprtech.com> wrote:
> > I just read the malstone report.  They report times for a Java version
> that
> > is many (5x) times slower than for a streaming implementation.  That
> single
> > fact indicates that the Java code is so appallingly bad that this is a
> very
> > bad benchmark.
>
> Slow Java code? That's funny ;) Running with Hotspot on by any chance?
>
> > On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout <jim.falgout@pervasive.com
> >wrote:
> >
> >> We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the
> >> data and the queries, if not the query generator. There is a Jira issue
> in
> >> Hive that discusses the TPC-H "benchmark" if you're interested. Sorry, I
> >> don't remember the issue number offhand.
> >>
> >> -----Original Message-----
> >> From: Shrinivas Joshi [mailto:jshrinivas@gmail.com]
> >> Sent: Friday, February 18, 2011 3:32 PM
> >> To: common-user@hadoop.apache.org
> >> Subject: benchmark choices
> >>
> >> Which workloads are used for serious benchmarking of Hadoop clusters? Do
> >> you care about any of the following workloads :
> >> TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench,
> >> sample apps shipped with Hadoop distro like PiEstimator, dbcount etc.
> >>
> >> Thanks,
> >> -Shrinivas
> >>
> >>
> >
>

Re: benchmark choices

Posted by Konstantin Boudnik <co...@apache.org>.

On Fri, Feb 18, 2011 at 14:35, Ted Dunning <td...@maprtech.com> wrote:
> I just read the malstone report.  They report times for a Java version that
> is many (5x) times slower than for a streaming implementation.  That single
> fact indicates that the Java code is so appallingly bad that this is a very
> bad benchmark.

Slow Java code? That's funny ;) Running with Hotspot on by any chance?

> On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout <ji...@pervasive.com>wrote:
>
>> We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the
>> data and the queries, if not the query generator. There is a Jira issue in
>> Hive that discusses the TPC-H "benchmark" if you're interested. Sorry, I
>> don't remember the issue number offhand.
>>
>> -----Original Message-----
>> From: Shrinivas Joshi [mailto:jshrinivas@gmail.com]
>> Sent: Friday, February 18, 2011 3:32 PM
>> To: common-user@hadoop.apache.org
>> Subject: benchmark choices
>>
>> Which workloads are used for serious benchmarking of Hadoop clusters? Do
>> you care about any of the following workloads :
>> TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench,
>> sample apps shipped with Hadoop distro like PiEstimator, dbcount etc.
>>
>> Thanks,
>> -Shrinivas
>>
>>
>

Re: benchmark choices

Posted by Ted Dunning <td...@maprtech.com>.

I just read the malstone report.  They report times for a Java version that
is many (5x) times slower than for a streaming implementation.  That single
fact indicates that the Java code is so appallingly bad that this is a very
bad benchmark.

On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout <ji...@pervasive.com>wrote:

> We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the
> data and the queries, if not the query generator. There is a Jira issue in
> Hive that discusses the TPC-H "benchmark" if you're interested. Sorry, I
> don't remember the issue number offhand.
>
> -----Original Message-----
> From: Shrinivas Joshi [mailto:jshrinivas@gmail.com]
> Sent: Friday, February 18, 2011 3:32 PM
> To: common-user@hadoop.apache.org
> Subject: benchmark choices
>
> Which workloads are used for serious benchmarking of Hadoop clusters? Do
> you care about any of the following workloads :
> TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench,
> sample apps shipped with Hadoop distro like PiEstimator, dbcount etc.
>
> Thanks,
> -Shrinivas
>
>

Re: benchmark choices

Posted by Shrinivas Joshi <js...@gmail.com>.

Thanks Jim. MRBench mentioned in this paper
http://dcslab.snu.ac.kr/~khjeon/papers/2008/icpads_mrbench.pdf looks like a
map/reduce port of TPC-H workload. BTW, MRBench mentioned in the above paper
and the one in mapred/src/test/mapred/org/apache/hadoop/mapred/MRBench.java
look different to me. Is that a fair statement?

-Shrinivas

On Fri, Feb 18, 2011 at 4:27 PM, Jim Falgout <ji...@pervasive.com>wrote:

> We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the
> data and the queries, if not the query generator. There is a Jira issue in
> Hive that discusses the TPC-H "benchmark" if you're interested. Sorry, I
> don't remember the issue number offhand.
>
> -----Original Message-----
> From: Shrinivas Joshi [mailto:jshrinivas@gmail.com]
> Sent: Friday, February 18, 2011 3:32 PM
> To: common-user@hadoop.apache.org
> Subject: benchmark choices
>
> Which workloads are used for serious benchmarking of Hadoop clusters? Do
> you care about any of the following workloads :
> TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench,
> sample apps shipped with Hadoop distro like PiEstimator, dbcount etc.
>
> Thanks,
> -Shrinivas
>
>

RE: benchmark choices

Posted by Jim Falgout <ji...@pervasive.com>.

We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the data and the queries, if not the query generator. There is a Jira issue in Hive that discusses the TPC-H "benchmark" if you're interested. Sorry, I don't remember the issue number offhand.

-----Original Message-----
From: Shrinivas Joshi [mailto:jshrinivas@gmail.com] 
Sent: Friday, February 18, 2011 3:32 PM
To: common-user@hadoop.apache.org
Subject: benchmark choices

Which workloads are used for serious benchmarking of Hadoop clusters? Do you care about any of the following workloads :
TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench, sample apps shipped with Hadoop distro like PiEstimator, dbcount etc.

Thanks,
-Shrinivas