You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2010/01/09 12:24:00 UTC

[CONF] Apache Lucene Mahout > Partial Implementation

Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: Partial Implementation (http://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation)

Added by abdelHakim Deneche:
---------------------------------------------------------------------
h1. Introduction

This quick start page shows how to build a decision forest using the partial implementation. This is a mapreduce implementation where each mapper builds a subset of the forest using only the data available in its partition. This allows building forests using large datasets as long as each partition can be loaded in-memory.

h1. Steps
h2. Download the data
* The current implementation is compatible with the UCI repository file format. In this example we'll use the KDD'99 dataset because its large enough to show the performances of the partial implementation.
You can download the dataset here http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
If you are using a cluster you can download the full dataset "kddcup.data.gz" (18M; 743M Uncompressed). But if you are using Hadoop on a single machine you should use the 10% subset "kddcup.data_10_percent.gz" (2.1M; 75M Uncompressed)
* Unzip the dataset
* Put the data in HDFS: {code}$HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata{code}

h2. Build the Job files
* In $MAHOUT_HOME/ run: {code}mvn install -DskipTests{code}

h2. Generate a file descriptor for the dataset: 
for the glass dataset (glass.data), run :
{code}
$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core-<VERSION>.job org.apache.mahout.df.tools.Describe -p testdata/kddcup-10p.data -f testdata/kddcup-10p.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
{code}
The "N 3 C 2 N C 4 N C 8 N 2 C 19 N L" string indicates the nature of the variables. which means 1 numerical(N) attribute, followed by 3 numerical(Categorical) attributes, ...L indicates the label, and you can use I to ignore some attributes

h2. Run the example

For now there are two implementations, one "mapred" that works with Hadoop old API (pre 0.20) and "mapreduce" that works with the new Hadoop API (0.20). Please note that future work will go in the "mapreduce" implementation only, and the "mapred" implementation should be removed in the future.

{code}
$HADOOP_HOME/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<version>.job org.apache.mahout.df.mapreduce.BuildForest -Dmapred.max.split.size=7488975 -oob -d testdata/kddcup-10p.data -ds testdata/kddcup-10p.info -sl 1 -p -t 100
{code}
which builds 100 trees (-t argument) using the partial implementation (-p). Each tree is built using one random selected attribute per node (-sl argument) and the example computes the out-of-bag error (-oob) 
The number of partitions is controlled by the -Dmapred.max.split.size argument that indicates to Hadoop the max. size of each partition, in this case 1/10 of the size of the dataset.
* The example outputs the Build Time and the oob error estimation

{code}
10/01/09 11:18:52 INFO mapreduce.BuildForest: Build Time: 0h 7m 48s 175
10/01/09 11:19:07 INFO mapreduce.BuildForest: oob error estimate : 0.051475544561870853
{code}

h2. Improving the results by redistributing the dataset tuples among the partitions
Because each tree is built using a subset of the dataset, the tuples should be well distributed among the partitions.
Let's start by computing the distribution frequencies of the labels among the partitions, run the following:

{code}
$HADOOP_HOME/hadoop jar $MAHOUT_HOME/core/target/mahout-core-<VERSION>.job org.apache.mahout.df.tools.Frequencies -Dmapred.max.split.size=7488975 -d testdata/kddcup-10p.data -ds testdata/kddcup-10p.info
{code}
* The tool output the number of tuple of each label for each partition
{code}
10/01/09 11:32:54 INFO tools.Frequencies: [37966, 3, 1, 1, 2, 10267, 53, 20, 99, 40, 197, 1, 6, 1000, 1, 2, 1, 104, 0, 0, 0, 0, 0]
10/01/09 11:32:54 INFO tools.Frequencies: [18271, 2, 1, 1, 20480, 8412, 0, 20, 100, 238, 563, 16, 2, 1002, 11, 537, 2, 127, 6, 20, 0, 0, 0]
10/01/09 11:32:54 INFO tools.Frequencies: [14818, 7, 6, 0, 20640, 11739, 0, 62, 173, 361, 102, 0, 0, 101, 0, 668, 0, 0, 1, 0, 1020, 2, 7]
10/01/09 11:32:54 INFO tools.Frequencies: [170, 0, 0, 0, 0, 48395, 0, 0, 25, 0, 0, 0, 0, 0, 0, 380, 0, 0, 0, 0, 0, 0, 0]
10/01/09 11:32:54 INFO tools.Frequencies: [0, 0, 0, 0, 0, 48948, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
10/01/09 11:32:54 INFO tools.Frequencies: [0, 0, 0, 0, 0, 48947, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
10/01/09 11:32:54 INFO tools.Frequencies: [1945, 2, 0, 1, 0, 46837, 0, 40, 0, 102, 74, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
10/01/09 11:32:54 INFO tools.Frequencies: [4840, 3, 0, 0, 43386, 819, 0, 100, 482, 118, 182, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
10/01/09 11:32:54 INFO tools.Frequencies: [406, 0, 1, 0, 1523, 47362, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
10/01/09 11:32:54 INFO tools.Frequencies: [18862, 13, 0, 0, 21170, 9064, 0, 22, 100, 180, 129, 3, 0, 100, 0, 1, 1, 0, 0, 0, 0, 0, 2]
{code}
each line represents a partition and each column a different label. You can see that the labels are not well distributed, for e.g. the fifth an sixth partitions don't contain any tuple corresponding to the first label, this means that any tree built using those partitions won't be able to classify tuples from the first label.
* To redistribute the tuples we'll call
{code}
$HADOOP_HOME/hadoop jar $MAHOUT_HOME/target/mahout-core-<version>.job org.apache.mahout.df.tools.UDistrib -d testdata/kddcup-10p.data -ds testdata/kddcup-10p.info -o testdata/ukddcup-10p.data -p 10
{code}
This tells the tool to redistribute the dataset kddcup-10p.data (-d argument) using its corresponding descriptor file (-ds) for 10 partitions (-p) and generates a new dataset (-o argument). You can use the same descriptor file for the new dataset as well.
Computing the frequencies on the new dataset gives the following results:
{code}
10/01/09 11:49:35 INFO tools.Frequencies: [9423, 3, 1, 0, 10720, 28070, 5, 26, 88, 104, 125, 2, 1, 220, 1, 159, 1, 23, 1, 2, 102, 0, 1]
10/01/09 11:49:35 INFO tools.Frequencies: [9727, 3, 1, 0, 10721, 27758, 5, 26, 98, 104, 125, 2, 1, 221, 1, 159, 1, 24, 1, 2, 102, 0, 1]
10/01/09 11:49:35 INFO tools.Frequencies: [9510, 2, 0, 0, 10657, 28044, 5, 26, 98, 104, 117, 2, 1, 221, 1, 159, 0, 23, 1, 2, 102, 0, 1]
10/01/09 11:49:35 INFO tools.Frequencies: [9396, 4, 1, 0, 10720, 28079, 5, 26, 98, 104, 123, 2, 1, 221, 1, 159, 0, 23, 1, 2, 102, 1, 1]
10/01/09 11:49:35 INFO tools.Frequencies: [9676, 2, 1, 0, 10461, 28079, 6, 26, 98, 104, 119, 2, 1, 210, 1, 159, 0, 23, 1, 2, 102, 1, 1]
10/01/09 11:49:35 INFO tools.Frequencies: [9728, 3, 1, 1, 10394, 28079, 6, 26, 98, 104, 125, 2, 1, 220, 2, 159, 0, 23, 0, 2, 102, 0, 1]
10/01/09 11:49:35 INFO tools.Frequencies: [9727, 3, 1, 1, 10395, 28079, 6, 27, 98, 104, 125, 2, 0, 220, 2, 159, 0, 23, 0, 2, 102, 0, 1]
10/01/09 11:49:35 INFO tools.Frequencies: [9728, 3, 1, 1, 10399, 28079, 5, 27, 97, 104, 125, 2, 0, 220, 1, 158, 0, 23, 0, 2, 102, 0, 1]
10/01/09 11:49:35 INFO tools.Frequencies: [9727, 3, 1, 0, 10398, 28079, 5, 27, 98, 104, 125, 1, 1, 220, 1, 159, 1, 23, 1, 2, 102, 0, 1]
10/01/09 11:49:35 INFO tools.Frequencies: [10636, 4, 1, 0, 12336, 28444, 5, 27, 108, 104, 138, 4, 1, 230, 1, 159, 1, 23, 1, 2, 102, 0, 1]
{code}
The dataset seems better distributed, and sure enough building the decision forest using this new dataset gives a much better oob estimation (6.4E-4) 

h2. Known Issues and limitations
The "Decision Forest" code is still "a work in progress", many features are still missings. Here is a list of some known issues:
* The input dataset must be a single file. Multiple input files are not, yet, supported
* The tree building is done when each mapper.close() method is called. Because the mappers don't refresh their state, the job can fail when the dataset is big and you try to build a large number of trees.
* When dealing with Categorical attributes, the current implemenation can generate decision trees with unnecessary nodes, although the error rate is not affected
* For now you can only build decision trees, but you can't, yet, use them to classify unkown data :(

Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action