You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Joel Welling <we...@psc.edu> on 2008/09/18 00:27:31 UTC

gridmix on a small cluster?

Hi folks;
  I'd like to try the gridmix benchmark on my small cluster (3 nodes at
8 cores each, Lustre with IB interconnect).  The documentation for
gridmix suggests that it will take 4 hours on a 500 node cluster, which
suggests it would take me something like a week to run.  Is there a way
to scale the problem size back?  I don't mind the file size too much,
but the running time would be excessive if things scale linearly with
the number of nodes.

Thanks,
-Joel

Re: gridmix on a small cluster?

Posted by Chris Douglas <ch...@yahoo-inc.com>.

Yes. If you look at the README, gridmix-env, and the generateData  
script, you should be able to alter the job mix to match your  
requirements. In particular, you probably want to look closely at the  
number of small, medium, and large jobs for each run. For a three node  
cluster, you might want to try running only the small jobs (possibly  
the medium jobs). Note that you don't have to generate the entropy  
dataset if you don't plan on running any large jobs (what it tests is  
not interesting on three nodes anyway). Note that the "real" dataset  
is 1000 times larger than what generateData does by default; a smaller  
dataset may let you keep the total number of jobs up, though you  
should also be wary of the load on the submitting node (see  
submissionScripts/sleep_if_too_busy). Keep in mind that each node may  
also store (possibly uncompressed) copies of the datasets as  
intermedate map outputs, so budgeting for local disk space will also  
be important while gridmix runs, particularly for "medium" jobs. Good  
luck. -C

On Sep 17, 2008, at 3:27 PM, Joel Welling wrote:

> Hi folks;
>  I'd like to try the gridmix benchmark on my small cluster (3 nodes at
> 8 cores each, Lustre with IB interconnect).  The documentation for
> gridmix suggests that it will take 4 hours on a 500 node cluster,  
> which
> suggests it would take me something like a week to run.  Is there a  
> way
> to scale the problem size back?  I don't mind the file size too much,
> but the running time would be excessive if things scale linearly with
> the number of nodes.
>
> Thanks,
> -Joel
>