You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2010/01/07 08:31:54 UTC
[jira] Issue Comment Edited: (MAHOUT-216) Improve the results of
MAHOUT-145 by uniformly distributing the classes in the partitioned data
[ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797544#action_12797544 ]
Deneche A. Hakim edited comment on MAHOUT-216 at 1/7/10 7:30 AM:
-----------------------------------------------------------------
Here are some results on a 5 slave ec2 cluster, using Kdd 100%
|| Num Map Tasks || Num Trees || Build Time || oob error ||
| 10 | 10 | 0h 2m 32s 643 | 1.7E-4 |
| 10 | 100 | 0h 10m 5s 231 | 1.2E-4 |
the results looks good, now I'll have to try the generated classifier on kdd test data and see...
Some known issues (that'll try to fix) are:
* mapreduce implementations cannot handle multiple file datasets
* because a lot of work is done when the mappers are closing I need to refresh some Hadoop counter or the job is canceled when trying to build a lot of trees (400)
was (Author: adeneche):
Here are some results on a 5 slave ec2 cluster, using Kdd 100%
|| Num Map Tasks || Num Trees || Build Time || oob error ||
| 10 | 10 | 0h 2m 32s 643 | 1.7E-4 |
| 10 | 100 | 0h 10m 5s 231 | 1.7E-4 |
the results looks good, now I'll have to try the generated classifier on kdd test data and see...
Some known issues (that'll try to fix) are:
* mapreduce implementations cannot handle multiple file datasets
* because a lot of work is done when the mappers are closing I need to refresh some Hadoop counter or the job is canceled when trying to build a lot of trees (400)
> Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data
> -----------------------------------------------------------------------------------------------
>
> Key: MAHOUT-216
> URL: https://issues.apache.org/jira/browse/MAHOUT-216
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Affects Versions: 0.2
> Reporter: Deneche A. Hakim
> Assignee: Deneche A. Hakim
> Fix For: 0.3
>
>
> the poor results of the partial decision forest implementation may be explained by the particular distribution of the partitioned data. For example, if a partition does not contain any instance of a given class, the decision trees built using this partition won't be able to classify this class.
> According to [CHAN, 95]:
> {quote}
> Random Selection of the partitioned data sets with a uniform distribution of classes is perhaps the most sensible solution. Here we may attempt to maintain the same frequency distribution over the ''class attribute" so that each partition represents a good but a smaller model of the entire training set
> {quote}
> [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable Data Mining"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.