You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2009/12/10 20:12:29 UTC

[jira] Created: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data
-----------------------------------------------------------------------------------------------

                 Key: MAHOUT-216
                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
             Project: Mahout
          Issue Type: Improvement
          Components: Classification
            Reporter: Deneche A. Hakim
            Assignee: Deneche A. Hakim


the poor results of the partial decision forest implementation may be explained by the particular distribution of the partitioned data. For example, if a partition does not contain any instance of a given class, the decision trees built using this partition won't be able to classify this class. 
According to [CHAN, 95]:

{quote}
Random Selection of the partitioned data sets with a uniform distribution of classes is perhaps the most sensible solution. Here we may attempt to maintain the same frequency distribution over the ''class attribute" so that each partition represents a good but a smaller model of the entire training set
{quote}

[CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable Data Mining" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Issue Comment Edited: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

Posted by Ted Dunning <te...@gmail.com>.

Nice work Deneche.

On Thu, Dec 10, 2009 at 12:34 PM, Deneche A. Hakim (JIRA)
<ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788902#action_12788902]
>
> Deneche A. Hakim edited comment on MAHOUT-216 at 12/10/09 8:32 PM:
> -------------------------------------------------------------------
>
> the next step is to implement a tool that redistributes the instances over
> the partitions, in order to get a uniform distribution of classes.
> The tool that I implemented has the following simple algorithm:
> {code}
> input:
> * data
> * number of partitions p
>
> create p output files, one for each partition
> for each class c, currents[c] = the partition where the next tuple of class
> c will be put. currents[] is randomly initialized with integers in the range
> [0, p[
>
> for each tuple do the following:
> * put the tuple in the file corresponding to the partition
> currents[tuple.class]
> * currents[tuple.class]++; if (currents[tuple.class] == p)
> currents[tuple.class) = 0;
> end for
> {code}
>
> computing the frequency distribution of the class attribute on the
> "uniform" data gives the following results:
> ||||label1||label2||label3||...||
> |partition 0|98218|     3|      1|      0|      107202| 280789| 5|      27|
>     98|     1041|   1248|   2|      0|      220|    1|      1590|   0|
>  231|    1|      2|      102|    0|      1|
> |partition 1|98220|     3|      1|      0|      107202| 280789| 5|      27|
>     98|     1041|   1248|   2|      0|      220|    1|      1589|   0|
>  231|    1|      2|      102|    1|      1|
> |partition 2|98214|     3|      1|      0|      107202| 280789| 5|      26|
>     98|     1041|   1248|   2|      1|      220|    1|      1589|   1|
>  231|    1|      2|      102|    1|      1|
> |partition 3|98221|     3|      1|      0|      107201| 280789| 5|      26|
>     98|     1041|   1248|   2|      1|      220|    1|      1589|   1|
>  231|    1|      2|      102|    0|      1|
> |partition 4|98221|     3|      1|      0|      107201| 280788| 5|      26|
>     98|     1042|   1249|   2|      1|      221|    1|      1589|   1|
>  232|    1|      2|      102|    0|      1|
> |partition 5|98218|     4|      1|      1|      107201| 280788| 5|      26|
>     97|     1042|   1248|   3|      1|      221|    1|      1589|   1|
>  232|    1|      2|      102|    0|      1|
> |partition 6|98220|     3|      1|      0|      107202| 280788| 6|      26|
>     98|     1042|   1248|   2|      1|      221|    1|      1589|   0|
>  232|    1|      2|      102|    0|      1|
> |partition 7|98219|     2|      1|      1|      107202| 280788| 6|      26|
>     98|     1041|   1248|   2|      1|      220|    1|      1589|   0|
>  232|    0|      2|      102|    0|      1|
> |partition 8|97541|     3|      0|      1|      107204| 281453| 6|      27|
>     98|     1041|   1248|   2|      1|      220|    2|      1589|   0|
>  232|    0|      2|      102|    0|      1|
> |partition 9|89489|     3|      1|      0|      107200| 280125| 5|      27|
>     98|     1041|   1248|   2|      1|      220|    2|      1590|   0|
>  232|    0|      2|      102|    0|      1|
>
> This time the classes are well distributed.
>
> Here is a quick comparison between the original dataset and the "uniform"
> dataset, building 10 trees over 10 partitions with the partial
> implementation on the original data gave an out-of-bag error of 0.2, using
> the "uniform" dataset" the out-of-bag error is 1.7 E-4.
>
> Although more tests are needed, this results are very encouraging
>
> PS:
> This program generates p files that should be used as training data instead
> of the original data. Although my mapreduce partial implementation should
> handle datasets split over many files, it did not handle them !!! For now I
> have to join the files in one single file before running the partial
> implementation =P
>
>      was (Author: adeneche):
>    the next step is to implement a tool that redistributes the instances
> over the partitions, in order to get a uniform distribution of classes.
> The tool that I implemented has the following simple algorithm:
> {code}
> input:
> * data
> * number of partitions p
>
> create p output files, one for each partition
> for each class c, currents[c] = the partition where the next tuple of class
> c will be put. currents[] is randomly initialized with integers in the range
> [0, p[
>
> for each tuple do the following:
> * put the tuple in the file corresponding to the partition
> currents[tuple.class]
> * currents[tuple.class]++; if (currents[tuple.class] == p)
> currents[tuple.class) = 0;
> end for
> {code}
>
> computing the frequency distribution of the class attribute on the
> "uniform" data gives the following results:
> ||||label1||label2||label3||...||
> |partition 0|98218|     3|      1|      0|      107202| 280789| 5|      27|
>     98|     1041|   1248|   2|      0|      220|    1|      1590|   0|
>  231|    1|      2|      102|    0|      1|
> |partition 1|98220|     3|      1|      0|      107202| 280789| 5|      27|
>     98|     1041|   1248|   2|      0|      220|    1|      1589|   0|
>  231|    1|      2|      102|    1|      1|
> |partition 2|98214|     3|      1|      0|      107202| 280789| 5|      26|
>     98|     1041|   1248|   2|      1|      220|    1|      1589|   1|
>  231|    1|      2|      102|    1|      1|
> |partition 3|98221|     3|      1|      0|      107201| 280789| 5|      26|
>     98|     1041|   1248|   2|      1|      220|    1|      1589|   1|
>  231|    1|      2|      102|    0|      1|
> |partition 4|98221|     3|      1|      0|      107201| 280788| 5|      26|
>     98|     1042|   1249|   2|      1|      221|    1|      1589|   1|
>  232|    1|      2|      102|    0|      1|
> |partition 5|98218|     4|      1|      1|      107201| 280788| 5|      26|
>     97|     1042|   1248|   3|      1|      221|    1|      1589|   1|
>  232|    1|      2|      102|    0|      1|
> |partition 6|98220|     3|      1|      0|      107202| 280788| 6|      26|
>     98|     1042|   1248|   2|      1|      221|    1|      1589|   0|
>  232|    1|      2|      102|    0|      1|
> |partition 7|98219|     2|      1|      1|      107202| 280788| 6|      26|
>     98|     1041|   1248|   2|      1|      220|    1|      1589|   0|
>  232|    0|      2|      102|    0|      1|
> |partition 8|97541|     3|      0|      1|      107204| 281453| 6|      27|
>     98|     1041|   1248|   2|      1|      220|    2|      1589|   0|
>  232|    0|      2|      102|    0|      1|
> |partition 9|89489|     3|      1|      0|      107200| 280125| 5|      27|
>     98|     1041|   1248|   2|      1|      220|    2|      1590|   0|
>  232|    0|      2|      102|    0|      1|
>
> This time the classes are well distributed.
>
> Here is a quick comparison between the original dataset and the "uniform"
> dataset, building 10 trees over 10 partitions with the partial
> implementation :
>
> ||||Original||Unifomed||
> |Out of Bag Error|0.2|1.7 E-4|
>
> Although more tests are needed, this results are very encouraging
>
> PS:
> This program generates p files that should be used as training data instead
> of the original data. Although my mapreduce partial implementation should
> handle datasets split over many files, it did not handle them !!! For now I
> have to join the files in one single file before running the partial
> implementation =P
>
> > Improve the results of MAHOUT-145 by uniformly distributing the classes
> in the partitioned data
> >
> -----------------------------------------------------------------------------------------------
> >
> >                 Key: MAHOUT-216
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Classification
> >            Reporter: Deneche A. Hakim
> >            Assignee: Deneche A. Hakim
> >
> > the poor results of the partial decision forest implementation may be
> explained by the particular distribution of the partitioned data. For
> example, if a partition does not contain any instance of a given class, the
> decision trees built using this partition won't be able to classify this
> class.
> > According to [CHAN, 95]:
> > {quote}
> > Random Selection of the partitioned data sets with a uniform distribution
> of classes is perhaps the most sensible solution. Here we may attempt to
> maintain the same frequency distribution over the ''class attribute" so that
> each partition represents a good but a smaller model of the entire training
> set
> > {quote}
> > [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for
> Scalable Data Mining"
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Ted Dunning, CTO
DeepDyve