You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-commits@hadoop.apache.org by cd...@apache.org on 2009/10/08 18:41:57 UTC
svn commit: r823229 - in /hadoop/mapreduce/branches/branch-0.21: CHANGES.txt src/contrib/gridmix/README src/docs/src/documentation/content/xdocs/gridmix.xml src/docs/src/documentation/content/xdocs/site.xml

Author: cdouglas
Date: Thu Oct  8 16:41:56 2009
New Revision: 823229

URL: http://svn.apache.org/viewvc?rev=823229&view=rev
Log:
MAPREDUCE-1063. Document gridmix benchmark.

Added:
    hadoop/mapreduce/branches/branch-0.21/src/contrib/gridmix/README
    hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/gridmix.xml
Modified:
    hadoop/mapreduce/branches/branch-0.21/CHANGES.txt
    hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/site.xml

Modified: hadoop/mapreduce/branches/branch-0.21/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/branch-0.21/CHANGES.txt?rev=823229&r1=823228&r2=823229&view=diff
==============================================================================
--- hadoop/mapreduce/branches/branch-0.21/CHANGES.txt (original)
+++ hadoop/mapreduce/branches/branch-0.21/CHANGES.txt Thu Oct  8 16:41:56 2009
@@ -414,6 +414,8 @@
     MAPREDUCE-639. Change Terasort example to reflect the 2009 updates. 
     (omalley)
 
+    MAPREDUCE-1063. Document gridmix benchmark. (cdouglas)
+
   BUG FIXES
 
     MAPREDUCE-878. Rename fair scheduler design doc to 

Added: hadoop/mapreduce/branches/branch-0.21/src/contrib/gridmix/README
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/branch-0.21/src/contrib/gridmix/README?rev=823229&view=auto
==============================================================================
--- hadoop/mapreduce/branches/branch-0.21/src/contrib/gridmix/README (added)
+++ hadoop/mapreduce/branches/branch-0.21/src/contrib/gridmix/README Thu Oct  8 16:41:56 2009
@@ -0,0 +1,22 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+This project implements the third version of Gridmix, a benchmark for live
+clusters. Given a description of jobs (a "trace") annotated with information
+about I/O, memory, etc. a synthetic mix of jobs will be generated and submitted
+to the cluster.
+
+Documentation of usage and configuration properties in forrest is available in
+src/docs/src/documentation/content/xdocs/gridmix.xml

Added: hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/gridmix.xml
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/gridmix.xml?rev=823229&view=auto
==============================================================================
--- hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/gridmix.xml (added)
+++ hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/gridmix.xml Thu Oct  8 16:41:56 2009
@@ -0,0 +1,164 @@
+<?xml version="1.0"?>
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
+
+<document>
+
+<header>
+  <title>Gridmix</title>
+</header>
+
+<body>
+
+  <section>
+  <title>Overview</title>
+
+  <p>Gridmix is a benchmark for live clusters. It submits a mix of synthetic
+  jobs, modeling a profile mined from production loads.</p>
+
+  <p>There exist three versions of the Gridmix tool. This document discusses
+  the third (checked into contrib), distinct from the two checked into the
+  benchmarks subdirectory. While the first two versions of the tool included
+  stripped-down versions of common jobs, both were principally saturation
+  tools for stressing the framework at scale. In support of a broader range of
+  deployments and finer-tuned job mixes, this version of the tool will attempt
+  to model the resource profiles of production jobs to identify bottlenecks,
+  guide development, and serve as a replacement for the existing gridmix
+  benchmarks.</p>
+
+  </section>
+
+  <section id="usage">
+
+  <title>Usage</title>
+
+  <p>To run Gridmix, one requires a job trace describing the job mix for a
+  given cluster. Such traces are typically genenerated by Rumen (see related
+  documentation). Gridmix also requires input data from which the synthetic
+  jobs will draw bytes. The input data need not be in any particular format,
+  as the synthetic jobs are currently binary readers. If one is running on a
+  new cluster, an optional step generating input data may precede the run.</p>
+
+  <p>Basic command line usage:</p>
+<source>
+
+bin/mapred org.apache.hadoop.mapred.gridmix.Gridmix [-generate &lt;MiB&gt;] &lt;iopath&gt; &lt;trace&gt;
+</source>
+
+  <p>The <code>-generate</code> parameter accepts standard units, e.g.
+  <code>100g</code> will generate 100 * 2<sup>30</sup> bytes. The
+  &lt;iopath&gt; parameter is the destination directory for generated and/or
+  the directory from which input data will be read. The &lt;trace&gt;
+  parameter is a path to a job trace. The following configuration parameters
+  are also accepted in the standard idiom, before other Gridmix
+  parameters.</p>
+
+  <section>
+  <title>Configuration parameters</title>
+  <p></p>
+  <table>
+    <tr><th> Parameter </th><th> Description </th><th> Notes </th></tr>
+    <tr><td><code>gridmix.output.directory</code></td>
+        <td>The directory into which output will be written. If specified, the
+        <code>iopath</code> will be relative to this parameter.</td>
+        <td>The submitting user must have read/write access to this
+        directory. The user should also be mindful of any quota issues that
+        may arise during a run.</td></tr>
+    <tr><td><code>gridmix.client.submit.threads</code></td>
+        <td>The number of threads submitting jobs to the cluster. This also
+        controls how many splits will be loaded into memory at a given time,
+        pending the submit time in the trace.</td>
+        <td>Splits are pregenerated to hit submission deadlines, so
+        particularly dense traces may want more submitting threads. However,
+        storing splits in memory is reasonably expensive, so one should raise
+        this cautiously.</td></tr>
+    <tr><td><code>gridmix.client.pending.queue.depth</code></td>
+        <td>The depth of the queue of job descriptions awaiting split
+        generation.</td>
+        <td>The jobs read from the trace occupy a queue of this depth before
+        being processed by the submission threads. It is unusual to configure
+        this.</td></tr>
+    <tr><td><code>gridmix.min.key.length</code></td>
+        <td>The key size for jobs submitted to the cluster.</td>
+        <td>While this is clearly a job-specific, even task-specific property,
+        no data on key length is currently available. Since the intermediate
+        data are random, memcomparable data, not even the sort is likely
+        affected. It exists as a tunable as no default value is appropriate,
+        but future versions will likely replace it with trace data.</td></tr>
+  </table>
+
+  </section>
+</section>
+
+<section id="assumptions">
+
+  <title>Simplifying Assumptions</title>
+
+  <p>Gridmix will be developed in stages, incorporating feedback and patches
+  from the community. Currently, its intent is to evaluate Map/Reduce and HDFS
+  performance and not the layers on top of them (i.e. the extensive lib and
+  subproject space). Given these two limitations, the following
+  characteristics of job load are not currently captured in job traces and
+  cannot be accurately reproduced in Gridmix.</p>
+
+  <table>
+  <tr><th>Property</th><th>Notes</th></tr>
+  <tr><td>CPU usage</td><td>We have no data for per-task CPU usage, so we
+  cannot attempt even an approximation. Gridmix tasks are never CPU bound
+  independent of I/O, though this surely happens in practice.</td></tr>
+  <tr><td>Filesystem properties</td><td>No attempt is made to match block
+  sizes, namespace hierarchies, or any property of input, intermediate, or
+  output data other than the bytes/records consumed and emitted from a given
+  task. This implies that some of the most heavily used parts of the system-
+  the compression libraries, text processing, streaming, etc.- cannot be
+  meaningfully tested with the current implementation.</td></tr>
+  <tr><td>I/O rates</td><td>The rate at which records are consumed/emitted is
+  assumed to be limited only by the speed of the reader/writer and constant
+  throughout the task.</td></tr>
+  <tr><td>Memory profile</td><td>No data on tasks' memory usage over time is
+  available, though the max heap size is retained.</td></tr>
+  <tr><td>Skew</td><td>The records consumed and emitted to/from a given task
+  are assumed to follow observed averages, i.e. records will be more regular
+  than may be seen in the wild. Each map also generates a proportional
+  percentage of data for each reduce, so a job with unbalanced input will be
+  flattened.</td></tr>
+  <tr><td>Job failure</td><td>User code is assumed to be correct.</td></tr>
+  <tr><td>Job independence</td><td>The output or outcome of one job does not
+  affect when or whether a subsequent job will run.</td></tr>
+  </table>
+
+</section>
+
+<section>
+
+  <title>Appendix</title>
+
+  <p>Issues tracking the implementations of <a
+  href="https://issues.apache.org/jira/browse/HADOOP-2369">gridmix1</a>, <a
+  href="https://issues.apache.org/jira/browse/HADOOP-3770">gridmix2</a>, and
+  <a href="https://issues.apache.org/jira/browse/MAPREDUCE-776">gridmix3</a>.
+  Other issues tracking the development of Gridmix can be found by searching
+  the Map/Reduce <a
+  href="https://issues.apache.org/jira/browse/MAPREDUCE">JIRA</a></p>
+
+</section>
+
+</body>
+
+</document>

Modified: hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/site.xml
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/site.xml?rev=823229&r1=823228&r2=823229&view=diff
==============================================================================
--- hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/site.xml (original)
+++ hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/site.xml Thu Oct  8 16:41:56 2009
@@ -43,6 +43,7 @@
 		<distcp    					label="DistCp"       href="distcp.html" />
 		<vaidya    					label="Vaidya" 		href="vaidya.html"/>
 		<archives  				label="Hadoop Archives"     href="hadoop_archives.html"/>
+		<gridmix  				label="Gridmix"     href="gridmix.html"/>
    </docs>
    
     <docs label="Schedulers">