You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2010/01/07 08:31:54 UTC

[jira] Issue Comment Edited: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

    [ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797544#action_12797544 ] 

Deneche A. Hakim edited comment on MAHOUT-216 at 1/7/10 7:30 AM:
-----------------------------------------------------------------

Here are some results on a 5 slave ec2 cluster, using Kdd 100%

 || Num Map Tasks || Num Trees || Build Time || oob error ||
 | 10 | 10 | 0h 2m 32s 643 | 1.7E-4 |
 | 10 | 100 | 0h 10m 5s 231 | 1.2E-4 |

the results looks good, now I'll have to try the generated classifier on kdd test data and see...

Some known issues (that'll try to fix) are:
* mapreduce implementations cannot handle multiple file datasets
* because a lot of work is done when the mappers are closing I need to refresh some Hadoop counter or the job is canceled when trying to build a lot of trees (400)


      was (Author: adeneche):
    Here are some results on a 5 slave ec2 cluster, using Kdd 100%

 || Num Map Tasks || Num Trees || Build Time || oob error ||
 | 10 | 10 | 0h 2m 32s 643 | 1.7E-4 |
 | 10 | 100 | 0h 10m 5s 231 | 1.7E-4 |

the results looks good, now I'll have to try the generated classifier on kdd test data and see...

Some known issues (that'll try to fix) are:
* mapreduce implementations cannot handle multiple file datasets
* because a lot of work is done when the mappers are closing I need to refresh some Hadoop counter or the job is canceled when trying to build a lot of trees (400)

  
> Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-216
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>             Fix For: 0.3
>
>
> the poor results of the partial decision forest implementation may be explained by the particular distribution of the partitioned data. For example, if a partition does not contain any instance of a given class, the decision trees built using this partition won't be able to classify this class. 
> According to [CHAN, 95]:
> {quote}
> Random Selection of the partitioned data sets with a uniform distribution of classes is perhaps the most sensible solution. Here we may attempt to maintain the same frequency distribution over the ''class attribute" so that each partition represents a good but a smaller model of the entire training set
> {quote}
> [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable Data Mining" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.