You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by sr...@apache.org on 2012/02/11 11:22:31 UTC

svn commit: r1243022 [21/38] - in /mahout/site/new_website: ./ MAHOUT/ MAHOUT/2010/ MAHOUT/2010/09/ MAHOUT/2010/09/14/ MAHOUT/2011/ MAHOUT/2011/10/ MAHOUT/2011/10/21/ MAHOUT/books-tutorials-and-talks.data/ MAHOUT/books-tutorials-talks.data/ MAHOUT/book...

Added: mahout/site/new_website/MAHOUT/parallel-frequent-pattern-mining.html
URL: http://svn.apache.org/viewvc/mahout/site/new_website/MAHOUT/parallel-frequent-pattern-mining.html?rev=1243022&view=auto
==============================================================================
--- mahout/site/new_website/MAHOUT/parallel-frequent-pattern-mining.html (added)
+++ mahout/site/new_website/MAHOUT/parallel-frequent-pattern-mining.html Sat Feb 11 10:22:15 2012
@@ -0,0 +1,268 @@
+
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<HTML>
+  <HEAD>
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/space.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/master.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/wiki-content.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/abs.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/menu.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/menu-ie.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/tables.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/panels.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/master-ie.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/renderer-macros.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/content-types.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/login.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/information-macros.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/layout-macros.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/default-theme.css">
+    <LINK type="text/css" rel="stylesheet" href="resources/space.css">
+    <STYLE type="text/css">
+      .footer {
+        background-image:      url('https://cwiki.apache.org/confluence/images/border/border_bottom.gif');
+        background-repeat:     repeat-x;
+        background-position:   left top;
+        padding-top:           4px;
+        color:                 #666;
+      }
+    </STYLE>
+    <SCRIPT type="text/javascript" language="javascript">
+      var hide = null;
+      var show = null;
+      var children = null;
+
+      function init() {
+        /* Search form initialization */
+        var form = document.forms['search'];
+        if (form != null) {
+          form.elements['domains'].value = location.hostname;
+          form.elements['sitesearch'].value = location.hostname;
+        }
+
+        /* Children initialization */
+        hide = document.getElementById('hide');
+        show = document.getElementById('show');
+        children = document.all != null ?
+                   document.all['children'] :
+                   document.getElementById('children');
+        if (children != null) {
+          children.style.display = 'none';
+          show.style.display = 'inline';
+          hide.style.display = 'none';
+        }
+      }
+
+      function showChildren() {
+        children.style.display = 'block';
+        show.style.display = 'none';
+        hide.style.display = 'inline';
+      }
+
+      function hideChildren() {
+        children.style.display = 'none';
+        show.style.display = 'inline';
+        hide.style.display = 'none';
+      }
+    </SCRIPT>
+    <TITLE>Parallel Frequent Pattern Mining</TITLE>
+  <META http-equiv="Content-Type" content="text/html;charset=UTF-8"></HEAD>
+  <BODY onload="init()">
+    <TABLE border="0" cellpadding="2" cellspacing="0" width="100%">
+      <TR class="topBar">
+        <TD align="left" valign="middle" class="topBarDiv" align="left" nowrap="">
+          &nbsp;<A href="mahout-wiki.html" title="Apache Mahout">Apache Mahout</A>&nbsp;&gt;&nbsp;<A href="mahout-wiki.html" title="Mahout Wiki">Mahout Wiki</A>&nbsp;&gt;&nbsp;<A href="algorithms.html" title="Algorithms">Algorithms</A>&nbsp;&gt;&nbsp;<A href="" title="Parallel Frequent Pattern Mining">Parallel Frequent Pattern Mining</A>
+        </TD>
+        <TD align="right" valign="middle" nowrap="">
+          <FORM name="search" action="http://www.google.com/search" method="get">
+            <INPUT type="hidden" name="ie" value="UTF-8">
+            <INPUT type="hidden" name="oe" value="UTF-8">
+            <INPUT type="hidden" name="domains" value="">
+            <INPUT type="hidden" name="sitesearch" value="">
+            <INPUT type="text" name="q" maxlength="255" value="">        
+            <INPUT type="submit" name="btnG" value="Google Search">
+          </FORM>
+        </TD>
+      </TR> 
+    </TABLE>
+
+    <DIV id="PageContent">
+      <DIV class="pageheader" style="padding: 6px 0px 0px 0px;">
+        <!-- We'll enable this once we figure out how to access (and save) the logo resource -->
+        <!--img src="/wiki/images/confluence_logo.gif" style="float: left; margin: 4px 4px 4px 10px;" border="0"-->
+        <DIV style="margin: 0px 10px 0px 10px" class="smalltext">Apache Mahout</DIV>
+        <DIV style="margin: 0px 10px 8px 10px" class="pagetitle">Parallel Frequent Pattern Mining</DIV>
+
+        <DIV class="greynavbar" align="right" style="padding: 2px 10px; margin: 0px;">
+          <A href="https://cwiki.apache.org/confluence/pages/editpage.action?pageId=13762823">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/notep_16.gif" height="16" width="16" border="0" align="absmiddle" title="Edit Page"></A>
+            <A href="https://cwiki.apache.org/confluence/pages/editpage.action?pageId=13762823">Edit Page</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/listpages.action?key=MAHOUT">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/browse_space.gif" height="16" width="16" border="0" align="absmiddle" title="Browse Space"></A>
+            <A href="https://cwiki.apache.org/confluence/pages/listpages.action?key=MAHOUT">Browse Space</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&fromPageId=13762823">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/add_page_16.gif" height="16" width="16" border="0" align="absmiddle" title="Add Page"></A>
+          <A href="https://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&fromPageId=13762823">Add Page</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/createblogpost.action?spaceKey=MAHOUT&fromPageId=13762823">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/add_blogentry_16.gif" height="16" width="16" border="0" align="absmiddle" title="Add News"></A>
+          <A href="https://cwiki.apache.org/confluence/pages/createblogpost.action?spaceKey=MAHOUT&fromPageId=13762823">Add News</A>
+        </DIV>
+      </DIV>
+
+      <DIV class="pagecontent">
+        <DIV class="wiki-content">
+          <P>Mahout has a Top K Parallel FPGrowth Implementation. Its based on the paper <A href="http://infolab.stanford.edu/~echang/recsys08-69.pdf" class="external-link" rel="nofollow">http://infolab.stanford.edu/~echang/recsys08-69.pdf</A> with some optimisations in mining the data.</P>
+
+<P>Given a huge transaction list, the algorithm finds all unique features(sets of field values) and eliminates those features whose frequency in the whole dataset is less that minSupport. Using these remaining features N, we find the top K closed patterns for each of them, generating a total of NxK patterns. FPGrowth Algorithm is a generic implementation, we can use any Object type to denote a feature. Current implementation requires you to use a String as the object type. You may implement a version for any object by creating Iterators, Convertors and TopKPatternWritable for that particular object. For more information please refer the package org.apache.mahout.fpm.pfpgrowth.convertors.string</P>
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+e.g:
+ FPGrowth&lt;<SPAN class="code-object">String</SPAN>&gt; fp = <SPAN class="code-keyword">new</SPAN> FPGrowth&lt;<SPAN class="code-object">String</SPAN>&gt;();
+ Set&lt;<SPAN class="code-object">String</SPAN>&gt; features = <SPAN class="code-keyword">new</SPAN> HashSet&lt;<SPAN class="code-object">String</SPAN>&gt;();
+ fp.generateTopKStringFrequentPatterns(
+     <SPAN class="code-keyword">new</SPAN> StringRecordIterator(<SPAN class="code-keyword">new</SPAN> FileLineIterable(<SPAN class="code-keyword">new</SPAN> File(input), encoding, <SPAN class="code-keyword">false</SPAN>), pattern),
+        fp.generateFList(
+          <SPAN class="code-keyword">new</SPAN> StringRecordIterator(<SPAN class="code-keyword">new</SPAN> FileLineIterable(<SPAN class="code-keyword">new</SPAN> File(input), encoding, <SPAN class="code-keyword">false</SPAN>), pattern), minSupport),
+         minSupport,
+        maxHeapSize,
+        features,
+        <SPAN class="code-keyword">new</SPAN> StringOutputConvertor(<SPAN class="code-keyword">new</SPAN> SequenceFileOutputCollector&lt;Text, TopKStringPatterns&gt;(writer))
+  );
+</PRE>
+</DIV></DIV>
+<UL>
+	<LI>The first argument is the iterator of transaction in this case its Iterator&lt;List&lt;String&gt;&gt;</LI>
+	<LI>The second argument is the output of generateFList function, which returns the frequent items and their frequencies from the given database transaction iterator</LI>
+	<LI>The third argument is the minimum Support of the pattern to be generated</LI>
+	<LI>The fourth argument is the maximum number of patterns to be mined for each feature</LI>
+	<LI>The fifth argument is the set of features for which the frequent patterns has to be mined</LI>
+	<LI>The last argument is an output collector which takes [key, value] of Feature and TopK Patterns of the format [String, List&lt;Pair&lt;List&lt;String&gt;, Long&gt;&gt;] and writes them to the appropriate writer class which takes care of storing the object, in this case in a Sequence File Output format</LI>
+</UL>
+
+
+<H2><A name="ParallelFrequentPatternMining-RunningFrequentPatternGrowthviacommandline"></A>Running Frequent Pattern Growth via command line</H2>
+
+<P>The command line launcher for string transaction data org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver has other features including specifying the regex pattern for spitting a string line of a transaction into the constituent features.</P>
+
+<P>Input files have to be in the following format.</P>
+
+<P>&lt;optional document id&gt;TAB&lt;TOKEN1&gt;SPACE&lt;TOKEN2&gt;SPACE....</P>
+
+<P>instead of tab you could use , or &#124; as the default tokenization is done using a java Regex pattern </P>
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">[,\t]*[,|\t][ ,\t]*</PRE>
+</DIV></DIV>
+<P>You can override this parameter to parse your log files or transaction files (each line is a transaction.) The FPGrowth algorithm mines the top K frequently occurring sets of items and their counts from the given input data</P>
+
+<P>$MAHOUT_HOME/core/src/test/resources/retail.dat is a sample dataset in this format. <BR>
+Other sample files are accident.dat.gz from <A href="http://fimi.cs.helsinki.fi/data/" class="external-link" rel="nofollow">http://fimi.cs.helsinki.fi/data/</A>. As a quick test, try this:</P>
+
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+bin/mahout fpg \
+     -i core/src/test/resources/retail.dat \
+     -o patterns \
+     -k 50 \
+     -method sequential \
+     -regex '[\ ]' \
+     -s 2
+</PRE>
+</DIV></DIV>
+
+<P>The minimumSupport parameter &#45;s is the minimum number of times a pattern or a feature needs to occur in the dataset so that it is included in the patterns generated. You can speed up the process by having a large value of s. There are cases where you will have less than k patterns for a particular feature as the rest don't for qualify the minimum support criteria</P>
+
+<P>Note that the input to the algorithm, could be uncompressed or compressed gz file or even a directory containing any number of such files.<BR>
+We modified the regex to use space to split the token. Note that input regex string is escaped.</P>
+
+<H2><A name="ParallelFrequentPatternMining-RunningParallelFPGrowth"></A>Running Parallel FPGrowth</H2>
+
+<P>Running parallel FPGrowth is as easy as adding changing the flag &#45;method mapreduce and adding the number of groups parameter e.g. &#45;g 20 for 20 groups. First, let's run the above sample test in map-reduce mode:</P>
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+bin/mahout fpg \
+     -i core/src/test/resources/retail.dat \
+     -o patterns \
+     -k 50 \
+     -method mapreduce \
+     -regex '[\ ]' \
+     -s 2
+</PRE>
+</DIV></DIV>
+<P>The above test took 102 seconds on dual-core laptop, v.s. 609 seconds in the sequential mode, (with 5 gigs of ram allocated). In a separate test, the first 1000 lines of retail.dat took 20 seconds in map/reduce v.s. 30 seconds in sequential mode.</P>
+
+<P>Here is another dataset which, while several times larger, requires much less time to find frequent patterns, as there are very few. Get accidents.dat.gz from <A href="http://fimi.cs.helsinki.fi/data/" class="external-link" rel="nofollow">http://fimi.cs.helsinki.fi/data/</A> and place it on your hdfs in a folder named accidents. Then, run the hadoop version of the FPGrowth job:</P>
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+bin/mahout fpg \
+     -i accidents \
+     -o patterns \
+     -k 50 \
+     -method mapreduce \
+     -regex '[\ ]' \
+     -s 2
+</PRE>
+</DIV></DIV>
+
+<P>OR to run a dataset of this size in sequential mode on a single machine let's give Mahout a lot more memory and only keep features with more than 300 members:</P>
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+export MAHOUT_HEAPSIZE=-Xmx5000m
+bin/mahout fpg \
+     -i accidents \
+     -o patterns \
+     -k 50 \
+     -method sequential \
+     -regex '[\ ]' \
+     -s 2
+</PRE>
+</DIV></DIV>
+
+
+<P>The numGroups parameter &#45;g in FPGrowthJob specifies the number of groups into which transactions have to be decomposed. The default of 1000 works very well on a single-machine cluster; this may be very different on large clusters.</P>
+
+<P>Note that accidents.dat has 340 unique features. So we chose &#45;g 10 to split the transactions across 10 shards where 34 patterns are mined from each shard. (Note: g doesnt need to be exactly divisible.) The Algorithm takes care of calculating the split. For better performance in large datasets and clusters, try not to mine for more than 20-25 features per shard. Stick to the defaults on a small machine.</P>
+
+<P>The numTreeCacheEntries parameter &#45;tc specifies the number of generated conditional FP-Trees to be kept in memory so that subsequent operations do not to regenerate them. Increasing this number increases the memory consumption but might improve speed until a certain point. This depends entirely on the dataset in question. A value of 5-10 is recommended for mining up to top 100 patterns for each feature.</P>
+
+<H2><A name="ParallelFrequentPatternMining-Viewingtheresults"></A>Viewing the results</H2>
+<P>The output will be dumped to a SequenceFile in the frequentpatterns directory in Text=&gt;TopKStringPatterns format. Run this command to see a few of the Frequent Patterns:</P>
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+bin/mahout seqdumper \
+     -s patterns/frequentpatterns/part-?-00000 \
+     -n 4
+</PRE>
+</DIV></DIV>
+<P>or replace -n 4 with -c for the count of patterns.</P>
+
+<P>Open questions: how does one experiment and monitor with these various parameters?</P>
+        </DIV>
+
+        
+      </DIV>
+    </DIV>
+    <DIV class="footer">
+      Generated by
+      <A href="http://www.atlassian.com/confluence/">Atlassian Confluence</A> (Version: 3.4.9 Build: 2042 Feb 14, 2011)
+      <A href="http://could.it/autoexport/">Auto Export Plugin</A> (Version: 1.0.0-dkulp)
+    </DIV>
+<SCRIPT type="text/javascript">
+
+  var _gaq = _gaq || [];
+  _gaq.push(['_setAccount', 'UA-17359171-1']);
+  _gaq.push(['_setDomainName', 'none']);
+  _gaq.push(['_setAllowLinker', true]);
+  _gaq.push(['_trackPageview']);
+
+  (function() {
+    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+  })();
+
+</SCRIPT>
+  </BODY>
+</HTML>
\ No newline at end of file

Added: mahout/site/new_website/MAHOUT/parallel-viterbi.html
URL: http://svn.apache.org/viewvc/mahout/site/new_website/MAHOUT/parallel-viterbi.html?rev=1243022&view=auto
==============================================================================
--- mahout/site/new_website/MAHOUT/parallel-viterbi.html (added)
+++ mahout/site/new_website/MAHOUT/parallel-viterbi.html Sat Feb 11 10:22:15 2012
@@ -0,0 +1,184 @@
+
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<HTML>
+  <HEAD>
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/space.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/master.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/wiki-content.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/abs.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/menu.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/menu-ie.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/tables.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/panels.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/master-ie.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/renderer-macros.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/content-types.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/login.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/information-macros.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/layout-macros.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/default-theme.css">
+    <LINK type="text/css" rel="stylesheet" href="resources/space.css">
+    <STYLE type="text/css">
+      .footer {
+        background-image:      url('https://cwiki.apache.org/confluence/images/border/border_bottom.gif');
+        background-repeat:     repeat-x;
+        background-position:   left top;
+        padding-top:           4px;
+        color:                 #666;
+      }
+    </STYLE>
+    <SCRIPT type="text/javascript" language="javascript">
+      var hide = null;
+      var show = null;
+      var children = null;
+
+      function init() {
+        /* Search form initialization */
+        var form = document.forms['search'];
+        if (form != null) {
+          form.elements['domains'].value = location.hostname;
+          form.elements['sitesearch'].value = location.hostname;
+        }
+
+        /* Children initialization */
+        hide = document.getElementById('hide');
+        show = document.getElementById('show');
+        children = document.all != null ?
+                   document.all['children'] :
+                   document.getElementById('children');
+        if (children != null) {
+          children.style.display = 'none';
+          show.style.display = 'inline';
+          hide.style.display = 'none';
+        }
+      }
+
+      function showChildren() {
+        children.style.display = 'block';
+        show.style.display = 'none';
+        hide.style.display = 'inline';
+      }
+
+      function hideChildren() {
+        children.style.display = 'none';
+        show.style.display = 'inline';
+        hide.style.display = 'none';
+      }
+    </SCRIPT>
+    <TITLE>Parallel Viterbi</TITLE>
+  <META http-equiv="Content-Type" content="text/html;charset=UTF-8"></HEAD>
+  <BODY onload="init()">
+    <TABLE border="0" cellpadding="2" cellspacing="0" width="100%">
+      <TR class="topBar">
+        <TD align="left" valign="middle" class="topBarDiv" align="left" nowrap="">
+          &nbsp;<A href="mahout-wiki.html" title="Apache Mahout">Apache Mahout</A>&nbsp;&gt;&nbsp;<A href="mahout-wiki.html" title="Mahout Wiki">Mahout Wiki</A>&nbsp;&gt;&nbsp;<A href="algorithms.html" title="Algorithms">Algorithms</A>&nbsp;&gt;&nbsp;<A href="" title="Parallel Viterbi">Parallel Viterbi</A>
+        </TD>
+        <TD align="right" valign="middle" nowrap="">
+          <FORM name="search" action="http://www.google.com/search" method="get">
+            <INPUT type="hidden" name="ie" value="UTF-8">
+            <INPUT type="hidden" name="oe" value="UTF-8">
+            <INPUT type="hidden" name="domains" value="">
+            <INPUT type="hidden" name="sitesearch" value="">
+            <INPUT type="text" name="q" maxlength="255" value="">        
+            <INPUT type="submit" name="btnG" value="Google Search">
+          </FORM>
+        </TD>
+      </TR> 
+    </TABLE>
+
+    <DIV id="PageContent">
+      <DIV class="pageheader" style="padding: 6px 0px 0px 0px;">
+        <!-- We'll enable this once we figure out how to access (and save) the logo resource -->
+        <!--img src="/wiki/images/confluence_logo.gif" style="float: left; margin: 4px 4px 4px 10px;" border="0"-->
+        <DIV style="margin: 0px 10px 0px 10px" class="smalltext">Apache Mahout</DIV>
+        <DIV style="margin: 0px 10px 8px 10px" class="pagetitle">Parallel Viterbi</DIV>
+
+        <DIV class="greynavbar" align="right" style="padding: 2px 10px; margin: 0px;">
+          <A href="https://cwiki.apache.org/confluence/pages/editpage.action?pageId=27826542">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/notep_16.gif" height="16" width="16" border="0" align="absmiddle" title="Edit Page"></A>
+            <A href="https://cwiki.apache.org/confluence/pages/editpage.action?pageId=27826542">Edit Page</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/listpages.action?key=MAHOUT">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/browse_space.gif" height="16" width="16" border="0" align="absmiddle" title="Browse Space"></A>
+            <A href="https://cwiki.apache.org/confluence/pages/listpages.action?key=MAHOUT">Browse Space</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&fromPageId=27826542">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/add_page_16.gif" height="16" width="16" border="0" align="absmiddle" title="Add Page"></A>
+          <A href="https://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&fromPageId=27826542">Add Page</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/createblogpost.action?spaceKey=MAHOUT&fromPageId=27826542">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/add_blogentry_16.gif" height="16" width="16" border="0" align="absmiddle" title="Add News"></A>
+          <A href="https://cwiki.apache.org/confluence/pages/createblogpost.action?spaceKey=MAHOUT&fromPageId=27826542">Add News</A>
+        </DIV>
+      </DIV>
+
+      <DIV class="pagecontent">
+        <DIV class="wiki-content">
+          <P>Viterbi algorithm is known as inference algorithm (synonyms: segmentation, decoding etc) for Hidden Markov Model [1] which finds the most likely sequence of hidden states by given sequence of observed states.</P>
+
+<P>Apache Mahout has both <A href="hidden-markov-models.html" title="Hidden Markov Models">sequential</A> and parallel (that's what you're reading about) implementations of the algorithm.</P>
+
+<P>Detailed presentation about parallel viterbi implementation could be found in <A href="http://modis.ispras.ru/seminar/wp-content/uploads/2011/11/Mahout_Viterbi.pdf" class="external-link" rel="nofollow">there</A> (in russian)</P>
+
+<H3><A name="ParallelViterbi-Parallelizationstrategy"></A>Parallelization strategy</H3>
+
+<P>is quite straightforward and based on data parallelizm. There are some studies on Viterbi (and Belief Propogation which is inference algorithm for loop-less Markov Random Fields and is quite similar to Viterbi) parallelization, but at the moment of writing this article none of them seem to be applyable for MapReduce paradigm.</P>
+
+<P>For example, forward pass of Viterbi could be represented in terms of matrix computations (as being dynamic programming algorithm) an thus essentially paralleled, but overhead for MapReduce would be greater than profit for parallel matrix multiplication.</P>
+
+<P>Input sequences of observed variables are supposed to be divided into the chunks of some length, enough to store O(N*K) data in main memory. A set of all chunks number N is called a &quot;serie number N&quot;. The algorithm process the data from serie number N-1 to serie number N (or vice versa), performing forward and backward Viterbi passes independently for each chunk (and consequently for each sequence) in reducers. Only data that is nescessary for computation of next serie is being outputed by direct output of reducers, all other data is collected in background. For example, when performing forward Viterbi pass only probabilities of last hidden state are nescessary for the next step, backpointers tables could be written in parallel to local store since they would be needed only for backward pass.</P>
+
+<P>If all the sequences are of the same length approximately and the number of sequences to decode is much more that number of reducers, O(N*M/K) time is required to decode them in parallel (N is number of each sequence, M is number of all sequences, K is number of reducers).</P>
+
+<H3><A name="ParallelViterbi-Dataformat"></A>Data format</H3>
+
+<P>Each sequence of observed states must be stored in sequence files, where key is the name of the sequence and value is&nbsp;ObservedSequenceWritable where number of chunk, data length and data itself are stored. At the moment it is hardcoded requirement, but it seems to be easy to implement any input file format that will output this information.</P>
+
+<P>The easiest way to get adjust plain text files with space-delimeted numbers of observed states to this format is to use &quot;bin/mahout hmmchunks&quot;.</P>
+
+<P>After parallel Viterbi is ended, decoded sequences will be stored in sequence files, one for each chunk (key is number of chunk, value is HiddenSequenceWritable). They could be unchunked to plain text space-delimeted numbers of hidden states by&nbsp;&quot;bin/mahout hmmchunks &#45;unchunk&quot;.</P>
+
+<H3><A name="ParallelViterbi-Usage"></A>Usage</H3>
+
+<P>Run &quot;bin/mahout pviterbi&quot; and see what it wants from you. That is:&nbsp;</P>
+
+<UL>
+	<LI>serialized HmmModel (i.e. by LossyHmmModelSerializer class)</LI>
+	<LI>input data (observed sequences) in the format described above</LI>
+	<LI>paths for temporary storage (i.e. backpointers) and for decoded sequences</LI>
+</UL>
+
+
+<P><B>References</B></P>
+
+<OL>
+	<LI><A href="http://en.wikipedia.org/wiki/Viterbi_algorithm" class="external-link" rel="nofollow">Wikipedia article</A></LI>
+</OL>
+
+        </DIV>
+
+        
+      </DIV>
+    </DIV>
+    <DIV class="footer">
+      Generated by
+      <A href="http://www.atlassian.com/confluence/">Atlassian Confluence</A> (Version: 3.4.9 Build: 2042 Feb 14, 2011)
+      <A href="http://could.it/autoexport/">Auto Export Plugin</A> (Version: 1.0.0-dkulp)
+    </DIV>
+<SCRIPT type="text/javascript">
+
+  var _gaq = _gaq || [];
+  _gaq.push(['_setAccount', 'UA-17359171-1']);
+  _gaq.push(['_setDomainName', 'none']);
+  _gaq.push(['_setAllowLinker', true]);
+  _gaq.push(['_trackPageview']);
+
+  (function() {
+    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+  })();
+
+</SCRIPT>
+  </BODY>
+</HTML>
\ No newline at end of file

Added: mahout/site/new_website/MAHOUT/parallelfrequentpatternmining.html
URL: http://svn.apache.org/viewvc/mahout/site/new_website/MAHOUT/parallelfrequentpatternmining.html?rev=1243022&view=auto
==============================================================================
--- mahout/site/new_website/MAHOUT/parallelfrequentpatternmining.html (added)
+++ mahout/site/new_website/MAHOUT/parallelfrequentpatternmining.html Sat Feb 11 10:22:15 2012
@@ -0,0 +1,211 @@
+
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<HTML>
+  <HEAD>
+    <LINK type="text/css" rel="stylesheet" href="resources/space.css">
+    <STYLE type="text/css">
+      .footer {
+        background-image:      url('http://cwiki.apache.org/confluence/images/border/border_bottom.gif');
+        background-repeat:     repeat-x;
+        background-position:   left top;
+        padding-top:           4px;
+        color:                 #666;
+      }
+    </STYLE>
+    <SCRIPT type="text/javascript" language="javascript">
+      var hide = null;
+      var show = null;
+      var children = null;
+
+      function init() {
+        /* Search form initialization */
+        var form = document.forms['search'];
+        if (form != null) {
+          form.elements['domains'].value = location.hostname;
+          form.elements['sitesearch'].value = location.hostname;
+        }
+
+        /* Children initialization */
+        hide = document.getElementById('hide');
+        show = document.getElementById('show');
+        children = document.all != null ?
+                   document.all['children'] :
+                   document.getElementById('children');
+        if (children != null) {
+          children.style.display = 'none';
+          show.style.display = 'inline';
+          hide.style.display = 'none';
+        }
+      }
+
+      function showChildren() {
+        children.style.display = 'block';
+        show.style.display = 'none';
+        hide.style.display = 'inline';
+      }
+
+      function hideChildren() {
+        children.style.display = 'none';
+        show.style.display = 'inline';
+        hide.style.display = 'none';
+      }
+    </SCRIPT>
+    <TITLE>ParallelFrequentPatternMining</TITLE>
+  <META http-equiv="Content-Type" content="text/html;charset=UTF-8"></HEAD>
+  <BODY onload="init()">
+    <TABLE border="0" cellpadding="2" cellspacing="0" width="100%">
+      <TR class="topBar">
+        <TD align="left" valign="middle" class="topBarDiv" align="left" nowrap="">
+          &nbsp;<A href="index.html" title="Apache Lucene Mahout">Apache Lucene Mahout</A>&nbsp;&gt;&nbsp;<A href="index.html" title="index">index</A>&nbsp;&gt;&nbsp;<A href="quickstart.html" title="QuickStart">QuickStart</A>&nbsp;&gt;&nbsp;<A href="" title="ParallelFrequentPatternMining">ParallelFrequentPatternMining</A>
+        </TD>
+        <TD align="right" valign="middle" nowrap="">
+          <FORM name="search" action="http://www.google.com/search" method="get">
+            <INPUT type="hidden" name="ie" value="UTF-8">
+            <INPUT type="hidden" name="oe" value="UTF-8">
+            <INPUT type="hidden" name="domains" value="">
+            <INPUT type="hidden" name="sitesearch" value="">
+            <INPUT type="text" name="q" maxlength="255" value="">        
+            <INPUT type="submit" name="btnG" value="Google Search">
+          </FORM>
+        </TD>
+      </TR> 
+    </TABLE>
+
+    <DIV id="PageContent">
+      <DIV class="pageheader" style="padding: 6px 0px 0px 0px;">
+        <!-- We'll enable this once we figure out how to access (and save) the logo resource -->
+        <!--img src="/wiki/images/confluence_logo.gif" style="float: left; margin: 4px 4px 4px 10px;" border="0"-->
+        <DIV style="margin: 0px 10px 0px 10px" class="smalltext">Apache Lucene Mahout</DIV>
+        <DIV style="margin: 0px 10px 8px 10px" class="pagetitle">ParallelFrequentPatternMining</DIV>
+
+        <DIV class="greynavbar" align="right" style="padding: 2px 10px; margin: 0px;">
+          <A href="http://cwiki.apache.org/confluence/pages/editpage.action?pageId=13762823">
+            <IMG src="http://cwiki.apache.org/confluence/images/icons/notep_16.gif" height="16" width="16" border="0" align="absmiddle" title="Edit Page"></A>
+            <A href="http://cwiki.apache.org/confluence/pages/editpage.action?pageId=13762823">Edit Page</A>
+          &nbsp;
+          <A href="http://cwiki.apache.org/confluence/pages/listpages.action?key=MAHOUT">
+            <IMG src="http://cwiki.apache.org/confluence/images/icons/browse_space.gif" height="16" width="16" border="0" align="absmiddle" title="Browse Space"></A>
+            <A href="http://cwiki.apache.org/confluence/pages/listpages.action?key=MAHOUT">Browse Space</A>
+          &nbsp;
+          <A href="http://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&fromPageId=13762823">
+            <IMG src="http://cwiki.apache.org/confluence/images/icons/add_page_16.gif" height="16" width="16" border="0" align="absmiddle" title="Add Page"></A>
+          <A href="http://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&fromPageId=13762823">Add Page</A>
+          &nbsp;
+          <A href="http://cwiki.apache.org/confluence/pages/createblogpost.action?spaceKey=MAHOUT&fromPageId=13762823">
+            <IMG src="http://cwiki.apache.org/confluence/images/icons/add_blogentry_16.gif" height="16" width="16" border="0" align="absmiddle" title="Add News"></A>
+          <A href="http://cwiki.apache.org/confluence/pages/createblogpost.action?spaceKey=MAHOUT&fromPageId=13762823">Add News</A>
+        </DIV>
+      </DIV>
+      <DIV class="pagesubheading" style="margin: 0px 10px 0px 10px;">
+        #editReport()
+      </DIV>
+
+      <DIV class="pagecontent">
+        <DIV class="wiki-content">
+          <H1><A name="ParallelFrequentPatternMining-FrequentPatternMining"></A>Frequent Pattern Mining</H1>
+<P>Mahout has a Top K Parallel FPGrowth Implementation. Its based on the paper <A href="http://infolab.stanford.edu/~echang/recsys08-69.pdf" class="external-link" rel="nofollow">http://infolab.stanford.edu/~echang/recsys08-69.pdf</A> with some optimisations in mining the dat. </P>
+
+<P>Given a huge transaction list, the algorithm finds all unique features(field values) and eliminates those features whose frequency in the whole dataset is less that minSupport. Using these remaining features N, we find the top K closed patterns for each of them, generating a total of NxK patterns. FPGrowth Algorithm is a generic implementation, we can use any Object type to denote a feature. Current implementation requires you to use a String as the object type. You may implement a version for any object by creating Iterators, Convertors and TopKPatternWritable for that particular object. For more information please refer the package org.apache.mahout.fpm.pfpgrowth.convertors.string </P>
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+e.g:
+ FPGrowth&lt;<SPAN class="code-object">String</SPAN>&gt; fp = <SPAN class="code-keyword">new</SPAN> FPGrowth&lt;<SPAN class="code-object">String</SPAN>&gt;();
+ Set&lt;<SPAN class="code-object">String</SPAN>&gt; features = <SPAN class="code-keyword">new</SPAN> HashSet&lt;<SPAN class="code-object">String</SPAN>&gt;();
+ fp.generateTopKStringFrequentPatterns(
+     <SPAN class="code-keyword">new</SPAN> StringRecordIterator(<SPAN class="code-keyword">new</SPAN> FileLineIterable(<SPAN class="code-keyword">new</SPAN> File(input), encoding, <SPAN class="code-keyword">false</SPAN>), pattern), 
+        fp.generateFList(
+          <SPAN class="code-keyword">new</SPAN> StringRecordIterator(<SPAN class="code-keyword">new</SPAN> FileLineIterable(<SPAN class="code-keyword">new</SPAN> File(input), encoding, <SPAN class="code-keyword">false</SPAN>), pattern), minSupport),
+         minSupport,
+        maxHeapSize, 
+        features,
+        <SPAN class="code-keyword">new</SPAN> StringOutputConvertor(<SPAN class="code-keyword">new</SPAN> SequenceFileOutputCollector&lt;Text, TopKStringPatterns&gt;(writer))
+  );
+</PRE>
+</DIV></DIV>
+<UL>
+	<LI>The first argument is the iterator of transaction in this case its Iterator&lt;List&lt;String&gt;&gt;</LI>
+	<LI>The second argument is the output of generateFList function, which returns the frequent items and their frequencies from the given database transaction iterator</LI>
+	<LI>The third argument is the minimum Support of the pattern to be generated</LI>
+	<LI>The fourth argument is the maximum number of patterns to be mined for each feature</LI>
+	<LI>The fifth argument is the set of features for which the frequent patterns has to be mined</LI>
+	<LI>The last argument is an output collector which takes <A href="http://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&title=key,%20value&linkCreation=true&fromPageId=13762823" class="createlink">key, value</A> of Feature and TopK Patterns of the format <SPAN class="error">&#91;String, List&lt;Pair&lt;List&lt;String&gt;, Long&gt;&gt;&#93;</SPAN> and writes them to the appropriate writer class which takes care of storing the object, in this case in a Sequence File Output format</LI>
+</UL>
+
+
+<H2><A name="ParallelFrequentPatternMining-RunningFrequentPatternGrowthviacommandline"></A>Running Frequent Pattern Growth via command line</H2>
+
+<P>The command line launcher for string transaction data org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver has other features including specifying the regex pattern for spitting a string line of a transaction into the constituent features</P>
+
+<P>Input files has to be in the following format.</P>
+
+<P>&lt;optional document id&gt;TAB&lt;TOKEN1&gt;SPACE&lt;TOKEN2&gt;SPACE.....<BR>
+instead of tab you could use , or | as the default tokenization is done using a java Regex pattern <SPAN class="error">&#91;,\t&#93;</SPAN><B><SPAN class="error">&#91;,|\t&#93;</SPAN>[ ,\t]</B><BR>
+You can override this parameter to parse your log files or transaction files </P>
+
+<P>each line is a transaction FPGrowth algorithm mines Top K frequently occurring sets of items and their counts from the given input data<BR>
+To run the example datasets from <A href="http://fimi.cs.helsinki.fi/data/" class="external-link" rel="nofollow">http://fimi.cs.helsinki.fi/data/</A>. Choose say accidents.dat.gz and download it to your mahout folder</P>
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+mvn -e  exec:java   -Dexec.mainClass=org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver \
+     -Dexec.args=&quot;-i accidents.dat.gz \
+     -o patterns \
+     -k 50 \
+     -method sequential \
+     -regex [\ ] \
+     -s 2
+</PRE>
+</DIV></DIV>
+
+<P>The minimum Support parameter is the minimum number of times a pattern or a feature needs to occur in the dataset so that it is included in the patterns generated. You can speed up the process by having a large value of s. There are cases where you will have less than k patterns for a particular feature as the rest don't qualify the minimum support criteria</P>
+
+<P>Note that the input to the algorithm, could be uncompressed or compressed gz file or even a directory containing any number of such files.<BR>
+We modified the regex to use space to split the token. Note that input regex string is escaped.<BR>
+The output will be dumped to a SequenceFile in the  frequentpatterns directory in Text=&gt;TopKStringPatterns format. You can use the &quot;bin/mahout seqdumper&quot; command to inspect the output file.  TODO FILL IN MORE HERE. </P>
+
+<H2><A name="ParallelFrequentPatternMining-RunningParallelFPGrowth"></A>Running Parallel FPGrowth </H2>
+
+<P>Running parallel FPGrowth is as easy as adding changing the flag -method mapreduce and adding the number of groups parameter e.g. -g 20 for 20 groups. Put the accidents.dat.gz on the hdfs in a folder named accidents</P>
+
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+To run on a hadoop cluster
+
+&lt;HADOOP-BIN&gt;/hadoop jar mahout-examples-xx.job org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver \
+     -i accidents \
+     -o patterns \
+     -k 50 \
+     -method mapreduce \
+     -g 10 \
+     -regex [\ ] \
+     -s 2
+
+OR to demo the algorithm on a single machine
+
+bin/mahout fpg \
+     -i accidents \
+     -o patterns \
+     -k 50 \
+     -method mapreduce \
+     -g 10 \
+     -regex [\ ] \
+     -s 2
+</PRE>
+</DIV></DIV>
+
+<P>Note that accidents have 340 unique features. So we chose -g 10 to split the transactions across 10 shards where 34 patterns are mined from each shard. note g doesnt need to be exactly divisible. The Algorithm takes care of calculating the split. For better performance in large datasets try not to mine for more than 20-25 features per shard. So adjust the groups accordingly</P>
+
+<P>The numGroups parameter in FPGrowthJob specifies the number of groups into which transactions have to be decomposed. <BR>
+The numTreeCacheEntries parameter specifies the number of generated conditional FP-Trees to be kept in memory so that subsequent operations do not to regenerate them. Increasing this number increases the memory consumption but might improve speed until a certain point. This depends entirely on the dataset in question. A value of 5-10 is recommended for mining up to top 100 patterns for each feature</P>
+
+        </DIV>
+
+        
+      </DIV>
+    </DIV>
+    <DIV class="footer">
+      Generated by
+      <A href="http://www.atlassian.com/confluence/">Atlassian Confluence</A> (Version: 3.2 Build: 1810 Mar 16, 2010)
+      <A href="http://could.it/autoexport/">Auto Export Plugin</A> (Version: 1.0.0-dkulp)
+    </DIV>
+  </BODY>
+</HTML>
\ No newline at end of file

Added: mahout/site/new_website/MAHOUT/partial-implementation.html
URL: http://svn.apache.org/viewvc/mahout/site/new_website/MAHOUT/partial-implementation.html?rev=1243022&view=auto
==============================================================================
--- mahout/site/new_website/MAHOUT/partial-implementation.html (added)
+++ mahout/site/new_website/MAHOUT/partial-implementation.html Sat Feb 11 10:22:15 2012
@@ -0,0 +1,248 @@
+
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<HTML>
+  <HEAD>
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/space.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/master.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/wiki-content.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/abs.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/menu.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/menu-ie.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/tables.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/panels.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/master-ie.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/renderer-macros.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/content-types.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/login.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/information-macros.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/layout-macros.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/default-theme.css">
+    <LINK type="text/css" rel="stylesheet" href="resources/space.css">
+    <STYLE type="text/css">
+      .footer {
+        background-image:      url('https://cwiki.apache.org/confluence/images/border/border_bottom.gif');
+        background-repeat:     repeat-x;
+        background-position:   left top;
+        padding-top:           4px;
+        color:                 #666;
+      }
+    </STYLE>
+    <SCRIPT type="text/javascript" language="javascript">
+      var hide = null;
+      var show = null;
+      var children = null;
+
+      function init() {
+        /* Search form initialization */
+        var form = document.forms['search'];
+        if (form != null) {
+          form.elements['domains'].value = location.hostname;
+          form.elements['sitesearch'].value = location.hostname;
+        }
+
+        /* Children initialization */
+        hide = document.getElementById('hide');
+        show = document.getElementById('show');
+        children = document.all != null ?
+                   document.all['children'] :
+                   document.getElementById('children');
+        if (children != null) {
+          children.style.display = 'none';
+          show.style.display = 'inline';
+          hide.style.display = 'none';
+        }
+      }
+
+      function showChildren() {
+        children.style.display = 'block';
+        show.style.display = 'none';
+        hide.style.display = 'inline';
+      }
+
+      function hideChildren() {
+        children.style.display = 'none';
+        show.style.display = 'inline';
+        hide.style.display = 'none';
+      }
+    </SCRIPT>
+    <TITLE>Partial Implementation</TITLE>
+  <META http-equiv="Content-Type" content="text/html;charset=UTF-8"></HEAD>
+  <BODY onload="init()">
+    <TABLE border="0" cellpadding="2" cellspacing="0" width="100%">
+      <TR class="topBar">
+        <TD align="left" valign="middle" class="topBarDiv" align="left" nowrap="">
+          &nbsp;<A href="mahout-wiki.html" title="Apache Mahout">Apache Mahout</A>&nbsp;&gt;&nbsp;<A href="" title="Partial Implementation">Partial Implementation</A>
+        </TD>
+        <TD align="right" valign="middle" nowrap="">
+          <FORM name="search" action="http://www.google.com/search" method="get">
+            <INPUT type="hidden" name="ie" value="UTF-8">
+            <INPUT type="hidden" name="oe" value="UTF-8">
+            <INPUT type="hidden" name="domains" value="">
+            <INPUT type="hidden" name="sitesearch" value="">
+            <INPUT type="text" name="q" maxlength="255" value="">        
+            <INPUT type="submit" name="btnG" value="Google Search">
+          </FORM>
+        </TD>
+      </TR> 
+    </TABLE>
+
+    <DIV id="PageContent">
+      <DIV class="pageheader" style="padding: 6px 0px 0px 0px;">
+        <!-- We'll enable this once we figure out how to access (and save) the logo resource -->
+        <!--img src="/wiki/images/confluence_logo.gif" style="float: left; margin: 4px 4px 4px 10px;" border="0"-->
+        <DIV style="margin: 0px 10px 0px 10px" class="smalltext">Apache Mahout</DIV>
+        <DIV style="margin: 0px 10px 8px 10px" class="pagetitle">Partial Implementation</DIV>
+
+        <DIV class="greynavbar" align="right" style="padding: 2px 10px; margin: 0px;">
+          <A href="https://cwiki.apache.org/confluence/pages/editpage.action?pageId=10846295">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/notep_16.gif" height="16" width="16" border="0" align="absmiddle" title="Edit Page"></A>
+            <A href="https://cwiki.apache.org/confluence/pages/editpage.action?pageId=10846295">Edit Page</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/listpages.action?key=MAHOUT">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/browse_space.gif" height="16" width="16" border="0" align="absmiddle" title="Browse Space"></A>
+            <A href="https://cwiki.apache.org/confluence/pages/listpages.action?key=MAHOUT">Browse Space</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&fromPageId=10846295">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/add_page_16.gif" height="16" width="16" border="0" align="absmiddle" title="Add Page"></A>
+          <A href="https://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&fromPageId=10846295">Add Page</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/createblogpost.action?spaceKey=MAHOUT&fromPageId=10846295">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/add_blogentry_16.gif" height="16" width="16" border="0" align="absmiddle" title="Add News"></A>
+          <A href="https://cwiki.apache.org/confluence/pages/createblogpost.action?spaceKey=MAHOUT&fromPageId=10846295">Add News</A>
+        </DIV>
+      </DIV>
+
+      <DIV class="pagecontent">
+        <DIV class="wiki-content">
+          <H1><A name="PartialImplementation-Introduction"></A>Introduction</H1>
+
+<P>This quick start page shows how to build a decision forest using the partial implementation. This tutorial also explains how to use the decision forest to classify new data.<BR>
+Partial Decision Forests is a mapreduce implementation where each mapper builds a subset of the forest using only the data available in its partition. This allows building forests using large datasets as long as each partition can be loaded in-memory.</P>
+
+<H1><A name="PartialImplementation-Steps"></A>Steps</H1>
+<H2><A name="PartialImplementation-Downloadthedata"></A>Download the data</H2>
+<UL>
+	<LI>The current implementation is compatible with the UCI repository file format. In this example we'll use the NSL-KDD dataset because its large enough to show the performances of the partial implementation.<BR>
+You can download the dataset here <A href="http://nsl.cs.unb.ca/NSL-KDD/" class="external-link" rel="nofollow">http://nsl.cs.unb.ca/NSL-KDD/</A><BR>
+You can either download the full training set &quot;KDDTrain+.ARFF&quot;, or a 20% subset &quot;KDDTrain+_20Percent.ARFF&quot; (we'll use the full dataset in this tutorial) and the test set &quot;KDDTest+.ARFF&quot;.</LI>
+	<LI>Open the train and test files and remove all the lines that begin with '@'. All those lines are at the top of the files. Actually you can keep those lines somewhere, because they'll help us describe the dataset to Mahout</LI>
+	<LI>Put the data in HDFS: 
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+$HADOOP_HOME/bin/hadoop fs -mkdir testdata
+$HADOOP_HOME/bin/hadoop fs -put &lt;PATH TO DATA&gt; testdata</PRE>
+</DIV></DIV></LI>
+</UL>
+
+
+<H2><A name="PartialImplementation-BuildtheJobfiles"></A>Build the Job files</H2>
+<UL>
+	<LI>In $MAHOUT_HOME/ run: 
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">mvn clean install -DskipTests</PRE>
+</DIV></DIV></LI>
+</UL>
+
+
+<H2><A name="PartialImplementation-Generateafiledescriptorforthedataset%3A"></A>Generate a file descriptor for the dataset: </H2>
+<P>run the following command:</P>
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core-&lt;VERSION&gt;-job.jar org.apache.mahout.df.tools.Describe -p testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
+</PRE>
+</DIV></DIV>
+<P>The &quot;N 3 C 2 N C 4 N C 8 N 2 C 19 N L&quot; string describes all the attributes of the data. In this cases, it means 1 numerical(N) attribute, followed by 3 Categorical(C) attributes, ...L indicates the label. You can also use 'I' to ignore some attributes</P>
+
+<H2><A name="PartialImplementation-Runtheexample"></A>Run the example</H2>
+
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+$HADOOP_HOME/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-&lt;version&gt;-job.jar org.apache.mahout.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -oob -d testdata/KDDTrain+.arff -ds testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest
+</PRE>
+</DIV></DIV>
+<P>which builds 100 trees (-t argument) using the partial implementation (-p). Each tree is built using 5 random selected attribute per node (-sl argument) the example computes the out-of-bag error (-oob) and outputs the decision tree in the &quot;nsl-forest&quot; directory (-o).<BR>
+The number of partitions is controlled by the -Dmapred.max.split.size argument that indicates to Hadoop the max. size of each partition, in this case 1/10 of the size of the dataset. Thus 10 partitions will be used.<BR>
+IMPORTANT: using less partitions should give better classification results, but needs a lot of memory. So if the Jobs are failing, try increasing the number of partitions.</P>
+<UL>
+	<LI>The example outputs the Build Time and the oob error estimation</LI>
+</UL>
+
+
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+10/03/13 17:57:29 INFO mapreduce.BuildForest: Build Time: 0h 7m 43s 582
+10/03/13 17:57:33 INFO mapreduce.BuildForest: oob error estimate : 0.002325895231517865
+10/03/13 17:57:33 INFO mapreduce.BuildForest: Storing the forest in: nsl-forest/forest.seq
+</PRE>
+</DIV></DIV>
+
+<H2><A name="PartialImplementation-UsingtheDecisionForesttoClassifynewdata"></A>Using the Decision Forest to Classify new data</H2>
+<P>run the following command:</P>
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+$HADOOP_HOME/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-&lt;version&gt;-job.jar org.apache.mahout.df.mapreduce.TestForest -i nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o predictions
+</PRE>
+</DIV></DIV>
+<P>This will compute the predictions of &quot;KDDTest+.arff&quot; dataset (-i argument) using the same data descriptor generated for the training dataset (-ds) and the decision forest built previously (-m). Optionally (if the test dataset contains the labels of the tuples) run the analyzer to compute the confusion matrix (-a), and you can also store the predictions in a text file or a directory of text files(-o). Passing the (-mr) parameter will use Hadoop to distribute the classification.</P>
+
+<UL>
+	<LI>The example should output the classification time and the confusion matrix</LI>
+</UL>
+
+
+<DIV class="code panel" style="border-width: 1px;"><DIV class="codeContent panelContent">
+<PRE class="code-java">
+10/03/13 18:08:56 INFO mapreduce.TestForest: Classification Time: 0h 0m 6s 355
+10/03/13 18:08:56 INFO mapreduce.TestForest: =======================================================
+Summary
+-------------------------------------------------------
+Correctly Classified Instances          :      17657	   78.3224%
+Incorrectly Classified Instances        :       4887	   21.6776%
+Total Classified Instances              :      22544
+
+=======================================================
+Confusion Matrix
+-------------------------------------------------------
+a    	b    	&lt;--Classified as
+9459 	252  	 |  9711  	a     = normal
+4635 	8198 	 |  12833 	b     = anomaly
+Default Category: unknown: 2
+</PRE>
+</DIV></DIV>
+
+<P>If the input is a single file then the output will be a single text file, in the above example 'predictions' would be one single file. If the input if a directory containing for example two files 'a.data' and 'b.data', then the output will be a directory 'predictions' containing two files 'a.data.out' and 'b.data.out'</P>
+
+<H2><A name="PartialImplementation-KnownIssuesandlimitations"></A>Known Issues and limitations</H2>
+<P>The &quot;Decision Forest&quot; code is still &quot;a work in progress&quot;, many features are still missing. Here is a list of some known issues:</P>
+<UL>
+	<LI>For now, the training does not support multiple input files. The input dataset must be one single file. Classifying new data does support multiple input files.</LI>
+	<LI>The tree building is done when each mapper.close() method is called. Because the mappers don't refresh their state, the job can fail when the dataset is big and you try to build a large number of trees.</LI>
+</UL>
+
+        </DIV>
+
+        
+      </DIV>
+    </DIV>
+    <DIV class="footer">
+      Generated by
+      <A href="http://www.atlassian.com/confluence/">Atlassian Confluence</A> (Version: 3.2 Build: 1810 Mar 16, 2010)
+      <A href="http://could.it/autoexport/">Auto Export Plugin</A> (Version: 1.0.0-dkulp)
+    </DIV>
+<SCRIPT type="text/javascript">
+
+  var _gaq = _gaq || [];
+  _gaq.push(['_setAccount', 'UA-17359171-1']);
+  _gaq.push(['_setDomainName', 'none']);
+  _gaq.push(['_setAllowLinker', true]);
+  _gaq.push(['_trackPageview']);
+
+  (function() {
+    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+  })();
+
+</SCRIPT>
+  </BODY>
+</HTML>
\ No newline at end of file

Added: mahout/site/new_website/MAHOUT/patch-check-list.html
URL: http://svn.apache.org/viewvc/mahout/site/new_website/MAHOUT/patch-check-list.html?rev=1243022&view=auto
==============================================================================
--- mahout/site/new_website/MAHOUT/patch-check-list.html (added)
+++ mahout/site/new_website/MAHOUT/patch-check-list.html Sat Feb 11 10:22:15 2012
@@ -0,0 +1,164 @@
+
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<HTML>
+  <HEAD>
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/space.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/master.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/wiki-content.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/abs.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/menu.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/menu-ie.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/tables.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/panels.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/master-ie.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/renderer-macros.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/content-types.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/login.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/information-macros.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/layout-macros.css">
+<LINK type="text/css" rel="stylesheet" href="https://cwiki.apache.org/confluence/display/MAHOUT/$stylebase/default-theme.css">
+    <LINK type="text/css" rel="stylesheet" href="resources/space.css">
+    <STYLE type="text/css">
+      .footer {
+        background-image:      url('https://cwiki.apache.org/confluence/images/border/border_bottom.gif');
+        background-repeat:     repeat-x;
+        background-position:   left top;
+        padding-top:           4px;
+        color:                 #666;
+      }
+    </STYLE>
+    <SCRIPT type="text/javascript" language="javascript">
+      var hide = null;
+      var show = null;
+      var children = null;
+
+      function init() {
+        /* Search form initialization */
+        var form = document.forms['search'];
+        if (form != null) {
+          form.elements['domains'].value = location.hostname;
+          form.elements['sitesearch'].value = location.hostname;
+        }
+
+        /* Children initialization */
+        hide = document.getElementById('hide');
+        show = document.getElementById('show');
+        children = document.all != null ?
+                   document.all['children'] :
+                   document.getElementById('children');
+        if (children != null) {
+          children.style.display = 'none';
+          show.style.display = 'inline';
+          hide.style.display = 'none';
+        }
+      }
+
+      function showChildren() {
+        children.style.display = 'block';
+        show.style.display = 'none';
+        hide.style.display = 'inline';
+      }
+
+      function hideChildren() {
+        children.style.display = 'none';
+        show.style.display = 'inline';
+        hide.style.display = 'none';
+      }
+    </SCRIPT>
+    <TITLE>Patch Check List</TITLE>
+  <META http-equiv="Content-Type" content="text/html;charset=UTF-8"></HEAD>
+  <BODY onload="init()">
+    <TABLE border="0" cellpadding="2" cellspacing="0" width="100%">
+      <TR class="topBar">
+        <TD align="left" valign="middle" class="topBarDiv" align="left" nowrap="">
+          &nbsp;<A href="mahout-wiki.html" title="Apache Mahout">Apache Mahout</A>&nbsp;&gt;&nbsp;<A href="mahout-wiki.html" title="Mahout Wiki">Mahout Wiki</A>&nbsp;&gt;&nbsp;<A href="issue-tracker.html" title="Issue Tracker">Issue Tracker</A>&nbsp;&gt;&nbsp;<A href="" title="Patch Check List">Patch Check List</A>
+        </TD>
+        <TD align="right" valign="middle" nowrap="">
+          <FORM name="search" action="http://www.google.com/search" method="get">
+            <INPUT type="hidden" name="ie" value="UTF-8">
+            <INPUT type="hidden" name="oe" value="UTF-8">
+            <INPUT type="hidden" name="domains" value="">
+            <INPUT type="hidden" name="sitesearch" value="">
+            <INPUT type="text" name="q" maxlength="255" value="">        
+            <INPUT type="submit" name="btnG" value="Google Search">
+          </FORM>
+        </TD>
+      </TR> 
+    </TABLE>
+
+    <DIV id="PageContent">
+      <DIV class="pageheader" style="padding: 6px 0px 0px 0px;">
+        <!-- We'll enable this once we figure out how to access (and save) the logo resource -->
+        <!--img src="/wiki/images/confluence_logo.gif" style="float: left; margin: 4px 4px 4px 10px;" border="0"-->
+        <DIV style="margin: 0px 10px 0px 10px" class="smalltext">Apache Mahout</DIV>
+        <DIV style="margin: 0px 10px 8px 10px" class="pagetitle">Patch Check List</DIV>
+
+        <DIV class="greynavbar" align="right" style="padding: 2px 10px; margin: 0px;">
+          <A href="https://cwiki.apache.org/confluence/pages/editpage.action?pageId=74909">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/notep_16.gif" height="16" width="16" border="0" align="absmiddle" title="Edit Page"></A>
+            <A href="https://cwiki.apache.org/confluence/pages/editpage.action?pageId=74909">Edit Page</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/listpages.action?key=MAHOUT">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/browse_space.gif" height="16" width="16" border="0" align="absmiddle" title="Browse Space"></A>
+            <A href="https://cwiki.apache.org/confluence/pages/listpages.action?key=MAHOUT">Browse Space</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&fromPageId=74909">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/add_page_16.gif" height="16" width="16" border="0" align="absmiddle" title="Add Page"></A>
+          <A href="https://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&fromPageId=74909">Add Page</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/createblogpost.action?spaceKey=MAHOUT&fromPageId=74909">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/add_blogentry_16.gif" height="16" width="16" border="0" align="absmiddle" title="Add News"></A>
+          <A href="https://cwiki.apache.org/confluence/pages/createblogpost.action?spaceKey=MAHOUT&fromPageId=74909">Add News</A>
+        </DIV>
+      </DIV>
+
+      <DIV class="pagecontent">
+        <DIV class="wiki-content">
+          <P>So, you want to apply a patch?  Here are tips, traps, etc. for dealing with patches (in no particular order):</P>
+
+<OL>
+	<LI>Get a fresh copy of trunk.  Or at least make sure you are up to date and clean your build area.  For complex patches, it is recommended you deal with a fresh checkout.</LI>
+	<LI>Look at the patch and see where it is applied.  Ideally it is generated from the root, but not everyone does this, especially for contrib areas.</LI>
+	<LI>patch &#45;p 0 &#45;i &lt;path to patch&gt;  Throw a &#45;-dry-run on there if you want to see what happens w/o screwing up your checkout.</LI>
+	<LI>Did the author write unit tests?  Are the unit tests worthwhile?</LI>
+	<LI>How are the benchmark results?  contrib/benchmarker may be used to test performance in before/after scenarios.</LI>
+	<LI>Are the licenses correct on newly added files? Has an ASF license been granted?</LI>
+	<LI>Update CHANGES.txt.  Give proper credit to the authors.</LI>
+	<LI>Make sure you update JIRA by assigning the issue to you so that others know you are working on it.</LI>
+	<LI>If it is a complex change and you have added to the original author's patch, it is suggested that you create a new patch and attach that to JIRA so that it can be discussed.</LI>
+	<LI>How's the documentation, esp. the javadocs?</LI>
+	<LI>Before committing, make sure you add any new documents to SVN.  Just b/c the patch added them doesn't mean you have.</LI>
+	<LI>Run all unit tests, verify all tests pass.</LI>
+	<LI>Generate javadocs, verify no javadoc errors/warnings were introduced by the patch.</LI>
+	<LI>Put in a meaningful commit message.  Reference the JIRA issue when appropriate.</LI>
+	<LI>Remember to update the issue in JIRA when you have completed it.</LI>
+	<LI>From the top directory &quot;ant rat-sources&quot; to make sure all the files have license headers.</LI>
+</OL>
+
+        </DIV>
+
+        
+      </DIV>
+    </DIV>
+    <DIV class="footer">
+      Generated by
+      <A href="http://www.atlassian.com/confluence/">Atlassian Confluence</A> (Version: 3.4.6 Build: 2036 Dec 21, 2010)
+      <A href="http://could.it/autoexport/">Auto Export Plugin</A> (Version: 1.0.0-dkulp)
+    </DIV>
+<SCRIPT type="text/javascript">
+
+  var _gaq = _gaq || [];
+  _gaq.push(['_setAccount', 'UA-17359171-1']);
+  _gaq.push(['_setDomainName', 'none']);
+  _gaq.push(['_setAllowLinker', true]);
+  _gaq.push(['_trackPageview']);
+
+  (function() {
+    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+  })();
+
+</SCRIPT>
+  </BODY>
+</HTML>
\ No newline at end of file

Added: mahout/site/new_website/MAHOUT/patchchecklist.html
URL: http://svn.apache.org/viewvc/mahout/site/new_website/MAHOUT/patchchecklist.html?rev=1243022&view=auto
==============================================================================
--- mahout/site/new_website/MAHOUT/patchchecklist.html (added)
+++ mahout/site/new_website/MAHOUT/patchchecklist.html Sat Feb 11 10:22:15 2012
@@ -0,0 +1,149 @@
+
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<HTML>
+  <HEAD>
+    <LINK type="text/css" rel="stylesheet" href="resources/space.css">
+    <STYLE type="text/css">
+      .footer {
+        background-image:      url('http://cwiki.apache.org/confluence/images/border/border_bottom.gif');
+        background-repeat:     repeat-x;
+        background-position:   left top;
+        padding-top:           4px;
+        color:                 #666;
+      }
+    </STYLE>
+    <SCRIPT type="text/javascript" language="javascript">
+      var hide = null;
+      var show = null;
+      var children = null;
+
+      function init() {
+        /* Search form initialization */
+        var form = document.forms['search'];
+        if (form != null) {
+          form.elements['domains'].value = location.hostname;
+          form.elements['sitesearch'].value = location.hostname;
+        }
+
+        /* Children initialization */
+        hide = document.getElementById('hide');
+        show = document.getElementById('show');
+        children = document.all != null ?
+                   document.all['children'] :
+                   document.getElementById('children');
+        if (children != null) {
+          children.style.display = 'none';
+          show.style.display = 'inline';
+          hide.style.display = 'none';
+        }
+      }
+
+      function showChildren() {
+        children.style.display = 'block';
+        show.style.display = 'none';
+        hide.style.display = 'inline';
+      }
+
+      function hideChildren() {
+        children.style.display = 'none';
+        show.style.display = 'inline';
+        hide.style.display = 'none';
+      }
+    </SCRIPT>
+    <TITLE>PatchCheckList</TITLE>
+  <META http-equiv="Content-Type" content="text/html;charset=UTF-8"></HEAD>
+  <BODY onload="init()">
+    <TABLE border="0" cellpadding="2" cellspacing="0" width="100%">
+      <TR class="topBar">
+        <TD align="left" valign="middle" class="topBarDiv" align="left" nowrap="">
+          &nbsp;<A href="index.html" title="Apache Lucene Mahout">Apache Lucene Mahout</A>&nbsp;&gt;&nbsp;<A href="index.html" title="index">index</A>&nbsp;&gt;&nbsp;<A href="" title="PatchCheckList">PatchCheckList</A>
+        </TD>
+        <TD align="right" valign="middle" nowrap="">
+          <FORM name="search" action="http://www.google.com/search" method="get">
+            <INPUT type="hidden" name="ie" value="UTF-8">
+            <INPUT type="hidden" name="oe" value="UTF-8">
+            <INPUT type="hidden" name="domains" value="">
+            <INPUT type="hidden" name="sitesearch" value="">
+            <INPUT type="text" name="q" maxlength="255" value="">        
+            <INPUT type="submit" name="btnG" value="Google Search">
+          </FORM>
+        </TD>
+      </TR> 
+    </TABLE>
+
+    <DIV id="PageContent">
+      <DIV class="pageheader" style="padding: 6px 0px 0px 0px;">
+        <!-- We'll enable this once we figure out how to access (and save) the logo resource -->
+        <!--img src="/wiki/images/confluence_logo.gif" style="float: left; margin: 4px 4px 4px 10px;" border="0"-->
+        <DIV style="margin: 0px 10px 0px 10px" class="smalltext">Apache Lucene Mahout</DIV>
+        <DIV style="margin: 0px 10px 8px 10px" class="pagetitle">PatchCheckList</DIV>
+
+        <DIV class="greynavbar" align="right" style="padding: 2px 10px; margin: 0px;">
+          <A href="http://cwiki.apache.org/confluence/pages/editpage.action?pageId=74909">
+            <IMG src="http://cwiki.apache.org/confluence/images/icons/notep_16.gif" height="16" width="16" border="0" align="absmiddle" title="Edit Page"></A>
+            <A href="http://cwiki.apache.org/confluence/pages/editpage.action?pageId=74909">Edit Page</A>
+          &nbsp;
+          <A href="http://cwiki.apache.org/confluence/pages/listpages.action?key=MAHOUT">
+            <IMG src="http://cwiki.apache.org/confluence/images/icons/browse_space.gif" height="16" width="16" border="0" align="absmiddle" title="Browse Space"></A>
+            <A href="http://cwiki.apache.org/confluence/pages/listpages.action?key=MAHOUT">Browse Space</A>
+          &nbsp;
+          <A href="http://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&fromPageId=74909">
+            <IMG src="http://cwiki.apache.org/confluence/images/icons/add_page_16.gif" height="16" width="16" border="0" align="absmiddle" title="Add Page"></A>
+          <A href="http://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&fromPageId=74909">Add Page</A>
+          &nbsp;
+          <A href="http://cwiki.apache.org/confluence/pages/createblogpost.action?spaceKey=MAHOUT&fromPageId=74909">
+            <IMG src="http://cwiki.apache.org/confluence/images/icons/add_blogentry_16.gif" height="16" width="16" border="0" align="absmiddle" title="Add News"></A>
+          <A href="http://cwiki.apache.org/confluence/pages/createblogpost.action?spaceKey=MAHOUT&fromPageId=74909">Add News</A>
+        </DIV>
+      </DIV>
+      <DIV class="pagesubheading" style="margin: 0px 10px 0px 10px;">
+                    Added by <A href="../~gsingers/index.html">Grant Ingersoll</A>, last edited by <A href="../~gsingers/index.html">Grant Ingersoll</A> on Sep 22, 2008
+                      &nbsp;(<A class="noprint" href="http://cwiki.apache.org/confluence/pages/diffpages.action?pageId=74909&originalId=97846">view change</A>)
+              
+      </DIV>
+
+      <DIV class="pagecontent">
+        <DIV class="wiki-content">
+          <H1><A name="PatchCheckList-PatchCheckList"></A>Patch Check List</H1>
+
+<P>So, you want to apply a patch?  Here are tips, traps, etc. for dealing with patches (in no particular order):</P>
+
+<OL>
+	<LI>Get a fresh copy of trunk.  Or at least make sure you are up to date and clean your build area.  For complex patches, it is recommended you deal with a fresh checkout.</LI>
+	<LI>Look at the patch and see where it is applied.  Ideally it is generated from the root, but not everyone does this, especially for contrib areas.</LI>
+	<LI>patch &#45;p 0 &#45;i &lt;path to patch&gt;  Throw a &#45;-dry-run on there if you want to see what happens w/o screwing up your checkout.</LI>
+	<LI>Did the author write unit tests?  Are the unit tests worthwhile?</LI>
+	<LI>How are the benchmark results?  contrib/benchmarker may be used to test performance in before/after scenarios.</LI>
+</OL>
+
+
+<OL>
+	<LI>Are the licenses correct on newly added files? Has an ASF license been granted?</LI>
+</OL>
+
+
+<OL>
+	<LI>Update CHANGES.txt.  Give proper credit to the authors.</LI>
+	<LI>Make sure you update JIRA by assigning the issue to you so that others know you are working on it.</LI>
+	<LI>If it is a complex change and you have added to the original author's patch, it is suggested that you create a new patch and attach that to JIRA so that it can be discussed.</LI>
+	<LI>How's the documentation, esp. the javadocs?</LI>
+	<LI>Before committing, make sure you add any new documents to SVN.  Just b/c the patch added them doesn't mean you have.</LI>
+	<LI>Run all unit tests, verify all tests pass.</LI>
+	<LI>Generate javadocs, verify no javadoc errors/warnings were introduced by the patch.</LI>
+	<LI>Put in a meaningful commit message.  Reference the JIRA issue when appropriate.</LI>
+	<LI>Remember to update the issue in JIRA when you have completed it.</LI>
+	<LI>From the top directory &quot;ant rat-sources&quot; to make sure all the files have license headers.</LI>
+</OL>
+
+        </DIV>
+
+        
+      </DIV>
+    </DIV>
+    <DIV class="footer">
+      Generated by
+      <A href="http://www.atlassian.com/confluence/">Atlassian Confluence</A> (Version: 2.10.4 Build: 1520 Jul 24, 2009)
+      <A href="http://could.it/autoexport/">Auto Export Plugin</A> (Version: 1.0.0.beta1)
+    </DIV>
+  </BODY>
+</HTML>

Added: mahout/site/new_website/MAHOUT/pearsoncorrelation.html
URL: http://svn.apache.org/viewvc/mahout/site/new_website/MAHOUT/pearsoncorrelation.html?rev=1243022&view=auto
==============================================================================
--- mahout/site/new_website/MAHOUT/pearsoncorrelation.html (added)
+++ mahout/site/new_website/MAHOUT/pearsoncorrelation.html Sat Feb 11 10:22:15 2012
@@ -0,0 +1,142 @@
+
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<HTML>
+  <HEAD>
+    <LINK type="text/css" rel="stylesheet" href="resources/space.css">
+    <STYLE type="text/css">
+      .footer {
+        background-image:      url('https://cwiki.apache.org/confluence/images/border/border_bottom.gif');
+        background-repeat:     repeat-x;
+        background-position:   left top;
+        padding-top:           4px;
+        color:                 #666;
+      }
+    </STYLE>
+    <SCRIPT type="text/javascript" language="javascript">
+      var hide = null;
+      var show = null;
+      var children = null;
+
+      function init() {
+        /* Search form initialization */
+        var form = document.forms['search'];
+        if (form != null) {
+          form.elements['domains'].value = location.hostname;
+          form.elements['sitesearch'].value = location.hostname;
+        }
+
+        /* Children initialization */
+        hide = document.getElementById('hide');
+        show = document.getElementById('show');
+        children = document.all != null ?
+                   document.all['children'] :
+                   document.getElementById('children');
+        if (children != null) {
+          children.style.display = 'none';
+          show.style.display = 'inline';
+          hide.style.display = 'none';
+        }
+      }
+
+      function showChildren() {
+        children.style.display = 'block';
+        show.style.display = 'none';
+        hide.style.display = 'inline';
+      }
+
+      function hideChildren() {
+        children.style.display = 'none';
+        show.style.display = 'inline';
+        hide.style.display = 'none';
+      }
+    </SCRIPT>
+    <TITLE>PearsonCorrelation</TITLE>
+  <META http-equiv="Content-Type" content="text/html;charset=UTF-8"></HEAD>
+  <BODY onload="init()">
+    <TABLE border="0" cellpadding="2" cellspacing="0" width="100%">
+      <TR class="topBar">
+        <TD align="left" valign="middle" class="topBarDiv" align="left" nowrap="">
+          &nbsp;<A href="mahout-wiki.html" title="Apache Mahout">Apache Mahout</A>&nbsp;&gt;&nbsp;<A href="mahout-wiki.html" title="Mahout Wiki">Mahout Wiki</A>&nbsp;&gt;&nbsp;<A href="glossary.html" title="Glossary">Glossary</A>&nbsp;&gt;&nbsp;<A href="" title="PearsonCorrelation">PearsonCorrelation</A>
+        </TD>
+        <TD align="right" valign="middle" nowrap="">
+          <FORM name="search" action="http://www.google.com/search" method="get">
+            <INPUT type="hidden" name="ie" value="UTF-8">
+            <INPUT type="hidden" name="oe" value="UTF-8">
+            <INPUT type="hidden" name="domains" value="">
+            <INPUT type="hidden" name="sitesearch" value="">
+            <INPUT type="text" name="q" maxlength="255" value="">        
+            <INPUT type="submit" name="btnG" value="Google Search">
+          </FORM>
+        </TD>
+      </TR> 
+    </TABLE>
+
+    <DIV id="PageContent">
+      <DIV class="pageheader" style="padding: 6px 0px 0px 0px;">
+        <!-- We'll enable this once we figure out how to access (and save) the logo resource -->
+        <!--img src="/wiki/images/confluence_logo.gif" style="float: left; margin: 4px 4px 4px 10px;" border="0"-->
+        <DIV style="margin: 0px 10px 0px 10px" class="smalltext">Apache Mahout</DIV>
+        <DIV style="margin: 0px 10px 8px 10px" class="pagetitle">PearsonCorrelation</DIV>
+
+        <DIV class="greynavbar" align="right" style="padding: 2px 10px; margin: 0px;">
+          <A href="https://cwiki.apache.org/confluence/pages/editpage.action?pageId=10846333">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/notep_16.gif" height="16" width="16" border="0" align="absmiddle" title="Edit Page"></A>
+            <A href="https://cwiki.apache.org/confluence/pages/editpage.action?pageId=10846333">Edit Page</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/listpages.action?key=MAHOUT">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/browse_space.gif" height="16" width="16" border="0" align="absmiddle" title="Browse Space"></A>
+            <A href="https://cwiki.apache.org/confluence/pages/listpages.action?key=MAHOUT">Browse Space</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&fromPageId=10846333">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/add_page_16.gif" height="16" width="16" border="0" align="absmiddle" title="Add Page"></A>
+          <A href="https://cwiki.apache.org/confluence/pages/createpage.action?spaceKey=MAHOUT&fromPageId=10846333">Add Page</A>
+          &nbsp;
+          <A href="https://cwiki.apache.org/confluence/pages/createblogpost.action?spaceKey=MAHOUT&fromPageId=10846333">
+            <IMG src="https://cwiki.apache.org/confluence/images/icons/add_blogentry_16.gif" height="16" width="16" border="0" align="absmiddle" title="Add News"></A>
+          <A href="https://cwiki.apache.org/confluence/pages/createblogpost.action?spaceKey=MAHOUT&fromPageId=10846333">Add News</A>
+        </DIV>
+      </DIV>
+      <DIV class="pagesubheading" style="margin: 0px 10px 0px 10px;">
+        #editReport()
+      </DIV>
+
+      <DIV class="pagecontent">
+        <DIV class="wiki-content">
+          <P>The Pearson correlation measures the degree to which two series of numbers tend to move together &ndash; values in corresponding positions tend to be high together, or low together. In particular it measures the strength of the linear relationship between the two series, the degree to which one can be estimated as a linear function of the other. It is often used in collaborative filtering as a similarity metric on users or items; users that tend to rate the same items high, or low, have a high Pearson correlation and therefore are &quot;similar&quot;.</P>
+
+<P>The Pearson correlation can behave very badly when small counts are involved.  For example, if you compare any two sequences with two distinct values, you get a correlation of 1.  To some degree, this problem can be avoided by not computing correlations for short sequences (with less than, say, 10 values).  </P>
+
+<P>Pearson correlation is sometimes used in collaborative filtering to define similarity between the ratings of two users on a common set of items.  In this application, it is a reasonable measure if there is sufficient overlap.  It, unfortunately, is not able to take advantage of the degree of overlapping ratings relative to the sets of all ratings.</P>
+
+<P>See Also</P>
+<UL>
+	<LI><A href="http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient" class="external-link" rel="nofollow">http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient</A></LI>
+</UL>
+
+        </DIV>
+
+        
+      </DIV>
+    </DIV>
+    <DIV class="footer">
+      Generated by
+      <A href="http://www.atlassian.com/confluence/">Atlassian Confluence</A> (Version: 3.2 Build: 1810 Mar 16, 2010)
+      <A href="http://could.it/autoexport/">Auto Export Plugin</A> (Version: 1.0.0-dkulp)
+    </DIV>
+<SCRIPT type="text/javascript">
+
+  var _gaq = _gaq || [];
+  _gaq.push(['_setAccount', 'UA-17359171-1']);
+  _gaq.push(['_setDomainName', 'none']);
+  _gaq.push(['_setAllowLinker', true]);
+  _gaq.push(['_trackPageview']);
+
+  (function() {
+    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+  })();
+
+</SCRIPT>
+  </BODY>
+</HTML>
\ No newline at end of file