You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2009/12/10 20:12:29 UTC

[jira] Created: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data
-----------------------------------------------------------------------------------------------

                 Key: MAHOUT-216
                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
             Project: Mahout
          Issue Type: Improvement
          Components: Classification
            Reporter: Deneche A. Hakim
            Assignee: Deneche A. Hakim


the poor results of the partial decision forest implementation may be explained by the particular distribution of the partitioned data. For example, if a partition does not contain any instance of a given class, the decision trees built using this partition won't be able to classify this class. 
According to [CHAN, 95]:

{quote}
Random Selection of the partitioned data sets with a uniform distribution of classes is perhaps the most sensible solution. Here we may attempt to maintain the same frequency distribution over the ''class attribute" so that each partition represents a good but a smaller model of the entire training set
{quote}

[CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable Data Mining" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Issue Comment Edited: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

Posted by Ted Dunning <te...@gmail.com>.
Nice work Deneche.

On Thu, Dec 10, 2009 at 12:34 PM, Deneche A. Hakim (JIRA)
<ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788902#action_12788902]
>
> Deneche A. Hakim edited comment on MAHOUT-216 at 12/10/09 8:32 PM:
> -------------------------------------------------------------------
>
> the next step is to implement a tool that redistributes the instances over
> the partitions, in order to get a uniform distribution of classes.
> The tool that I implemented has the following simple algorithm:
> {code}
> input:
> * data
> * number of partitions p
>
> create p output files, one for each partition
> for each class c, currents[c] = the partition where the next tuple of class
> c will be put. currents[] is randomly initialized with integers in the range
> [0, p[
>
> for each tuple do the following:
> * put the tuple in the file corresponding to the partition
> currents[tuple.class]
> * currents[tuple.class]++; if (currents[tuple.class] == p)
> currents[tuple.class) = 0;
> end for
> {code}
>
> computing the frequency distribution of the class attribute on the
> "uniform" data gives the following results:
> ||||label1||label2||label3||...||
> |partition 0|98218|     3|      1|      0|      107202| 280789| 5|      27|
>     98|     1041|   1248|   2|      0|      220|    1|      1590|   0|
>  231|    1|      2|      102|    0|      1|
> |partition 1|98220|     3|      1|      0|      107202| 280789| 5|      27|
>     98|     1041|   1248|   2|      0|      220|    1|      1589|   0|
>  231|    1|      2|      102|    1|      1|
> |partition 2|98214|     3|      1|      0|      107202| 280789| 5|      26|
>     98|     1041|   1248|   2|      1|      220|    1|      1589|   1|
>  231|    1|      2|      102|    1|      1|
> |partition 3|98221|     3|      1|      0|      107201| 280789| 5|      26|
>     98|     1041|   1248|   2|      1|      220|    1|      1589|   1|
>  231|    1|      2|      102|    0|      1|
> |partition 4|98221|     3|      1|      0|      107201| 280788| 5|      26|
>     98|     1042|   1249|   2|      1|      221|    1|      1589|   1|
>  232|    1|      2|      102|    0|      1|
> |partition 5|98218|     4|      1|      1|      107201| 280788| 5|      26|
>     97|     1042|   1248|   3|      1|      221|    1|      1589|   1|
>  232|    1|      2|      102|    0|      1|
> |partition 6|98220|     3|      1|      0|      107202| 280788| 6|      26|
>     98|     1042|   1248|   2|      1|      221|    1|      1589|   0|
>  232|    1|      2|      102|    0|      1|
> |partition 7|98219|     2|      1|      1|      107202| 280788| 6|      26|
>     98|     1041|   1248|   2|      1|      220|    1|      1589|   0|
>  232|    0|      2|      102|    0|      1|
> |partition 8|97541|     3|      0|      1|      107204| 281453| 6|      27|
>     98|     1041|   1248|   2|      1|      220|    2|      1589|   0|
>  232|    0|      2|      102|    0|      1|
> |partition 9|89489|     3|      1|      0|      107200| 280125| 5|      27|
>     98|     1041|   1248|   2|      1|      220|    2|      1590|   0|
>  232|    0|      2|      102|    0|      1|
>
> This time the classes are well distributed.
>
> Here is a quick comparison between the original dataset and the "uniform"
> dataset, building 10 trees over 10 partitions with the partial
> implementation on the original data gave an out-of-bag error of 0.2, using
> the "uniform" dataset" the out-of-bag error is 1.7 E-4.
>
> Although more tests are needed, this results are very encouraging
>
> PS:
> This program generates p files that should be used as training data instead
> of the original data. Although my mapreduce partial implementation should
> handle datasets split over many files, it did not handle them !!! For now I
> have to join the files in one single file before running the partial
> implementation =P
>
>      was (Author: adeneche):
>    the next step is to implement a tool that redistributes the instances
> over the partitions, in order to get a uniform distribution of classes.
> The tool that I implemented has the following simple algorithm:
> {code}
> input:
> * data
> * number of partitions p
>
> create p output files, one for each partition
> for each class c, currents[c] = the partition where the next tuple of class
> c will be put. currents[] is randomly initialized with integers in the range
> [0, p[
>
> for each tuple do the following:
> * put the tuple in the file corresponding to the partition
> currents[tuple.class]
> * currents[tuple.class]++; if (currents[tuple.class] == p)
> currents[tuple.class) = 0;
> end for
> {code}
>
> computing the frequency distribution of the class attribute on the
> "uniform" data gives the following results:
> ||||label1||label2||label3||...||
> |partition 0|98218|     3|      1|      0|      107202| 280789| 5|      27|
>     98|     1041|   1248|   2|      0|      220|    1|      1590|   0|
>  231|    1|      2|      102|    0|      1|
> |partition 1|98220|     3|      1|      0|      107202| 280789| 5|      27|
>     98|     1041|   1248|   2|      0|      220|    1|      1589|   0|
>  231|    1|      2|      102|    1|      1|
> |partition 2|98214|     3|      1|      0|      107202| 280789| 5|      26|
>     98|     1041|   1248|   2|      1|      220|    1|      1589|   1|
>  231|    1|      2|      102|    1|      1|
> |partition 3|98221|     3|      1|      0|      107201| 280789| 5|      26|
>     98|     1041|   1248|   2|      1|      220|    1|      1589|   1|
>  231|    1|      2|      102|    0|      1|
> |partition 4|98221|     3|      1|      0|      107201| 280788| 5|      26|
>     98|     1042|   1249|   2|      1|      221|    1|      1589|   1|
>  232|    1|      2|      102|    0|      1|
> |partition 5|98218|     4|      1|      1|      107201| 280788| 5|      26|
>     97|     1042|   1248|   3|      1|      221|    1|      1589|   1|
>  232|    1|      2|      102|    0|      1|
> |partition 6|98220|     3|      1|      0|      107202| 280788| 6|      26|
>     98|     1042|   1248|   2|      1|      221|    1|      1589|   0|
>  232|    1|      2|      102|    0|      1|
> |partition 7|98219|     2|      1|      1|      107202| 280788| 6|      26|
>     98|     1041|   1248|   2|      1|      220|    1|      1589|   0|
>  232|    0|      2|      102|    0|      1|
> |partition 8|97541|     3|      0|      1|      107204| 281453| 6|      27|
>     98|     1041|   1248|   2|      1|      220|    2|      1589|   0|
>  232|    0|      2|      102|    0|      1|
> |partition 9|89489|     3|      1|      0|      107200| 280125| 5|      27|
>     98|     1041|   1248|   2|      1|      220|    2|      1590|   0|
>  232|    0|      2|      102|    0|      1|
>
> This time the classes are well distributed.
>
> Here is a quick comparison between the original dataset and the "uniform"
> dataset, building 10 trees over 10 partitions with the partial
> implementation :
>
> ||||Original||Unifomed||
> |Out of Bag Error|0.2|1.7 E-4|
>
> Although more tests are needed, this results are very encouraging
>
> PS:
> This program generates p files that should be used as training data instead
> of the original data. Although my mapreduce partial implementation should
> handle datasets split over many files, it did not handle them !!! For now I
> have to join the files in one single file before running the partial
> implementation =P
>
> > Improve the results of MAHOUT-145 by uniformly distributing the classes
> in the partitioned data
> >
> -----------------------------------------------------------------------------------------------
> >
> >                 Key: MAHOUT-216
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Classification
> >            Reporter: Deneche A. Hakim
> >            Assignee: Deneche A. Hakim
> >
> > the poor results of the partial decision forest implementation may be
> explained by the particular distribution of the partitioned data. For
> example, if a partition does not contain any instance of a given class, the
> decision trees built using this partition won't be able to classify this
> class.
> > According to [CHAN, 95]:
> > {quote}
> > Random Selection of the partitioned data sets with a uniform distribution
> of classes is perhaps the most sensible solution. Here we may attempt to
> maintain the same frequency distribution over the ''class attribute" so that
> each partition represents a good but a smaller model of the entire training
> set
> > {quote}
> > [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for
> Scalable Data Mining"
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Ted Dunning, CTO
DeepDyve

[jira] Updated: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-216:
-----------------------------

        Fix Version/s: 0.3
    Affects Version/s: 0.2

> Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-216
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>             Fix For: 0.3
>
>
> the poor results of the partial decision forest implementation may be explained by the particular distribution of the partitioned data. For example, if a partition does not contain any instance of a given class, the decision trees built using this partition won't be able to classify this class. 
> According to [CHAN, 95]:
> {quote}
> Random Selection of the partitioned data sets with a uniform distribution of classes is perhaps the most sensible solution. Here we may attempt to maintain the same frequency distribution over the ''class attribute" so that each partition represents a good but a smaller model of the entire training set
> {quote}
> [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable Data Mining" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

Posted by "Deneche A. Hakim (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Deneche A. Hakim resolved MAHOUT-216.
-------------------------------------

    Resolution: Fixed

Done.

> Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-216
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>             Fix For: 0.3
>
>
> the poor results of the partial decision forest implementation may be explained by the particular distribution of the partitioned data. For example, if a partition does not contain any instance of a given class, the decision trees built using this partition won't be able to classify this class. 
> According to [CHAN, 95]:
> {quote}
> Random Selection of the partitioned data sets with a uniform distribution of classes is perhaps the most sensible solution. Here we may attempt to maintain the same frequency distribution over the ''class attribute" so that each partition represents a good but a smaller model of the entire training set
> {quote}
> [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable Data Mining" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

Posted by "Deneche A. Hakim (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797544#action_12797544 ] 

Deneche A. Hakim edited comment on MAHOUT-216 at 1/7/10 7:30 AM:
-----------------------------------------------------------------

Here are some results on a 5 slave ec2 cluster, using Kdd 100%

 || Num Map Tasks || Num Trees || Build Time || oob error ||
 | 10 | 10 | 0h 2m 32s 643 | 1.7E-4 |
 | 10 | 100 | 0h 10m 5s 231 | 1.2E-4 |

the results looks good, now I'll have to try the generated classifier on kdd test data and see...

Some known issues (that'll try to fix) are:
* mapreduce implementations cannot handle multiple file datasets
* because a lot of work is done when the mappers are closing I need to refresh some Hadoop counter or the job is canceled when trying to build a lot of trees (400)


      was (Author: adeneche):
    Here are some results on a 5 slave ec2 cluster, using Kdd 100%

 || Num Map Tasks || Num Trees || Build Time || oob error ||
 | 10 | 10 | 0h 2m 32s 643 | 1.7E-4 |
 | 10 | 100 | 0h 10m 5s 231 | 1.7E-4 |

the results looks good, now I'll have to try the generated classifier on kdd test data and see...

Some known issues (that'll try to fix) are:
* mapreduce implementations cannot handle multiple file datasets
* because a lot of work is done when the mappers are closing I need to refresh some Hadoop counter or the job is canceled when trying to build a lot of trees (400)

  
> Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-216
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>             Fix For: 0.3
>
>
> the poor results of the partial decision forest implementation may be explained by the particular distribution of the partitioned data. For example, if a partition does not contain any instance of a given class, the decision trees built using this partition won't be able to classify this class. 
> According to [CHAN, 95]:
> {quote}
> Random Selection of the partitioned data sets with a uniform distribution of classes is perhaps the most sensible solution. Here we may attempt to maintain the same frequency distribution over the ''class attribute" so that each partition represents a good but a smaller model of the entire training set
> {quote}
> [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable Data Mining" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

Posted by "Deneche A. Hakim (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788902#action_12788902 ] 

Deneche A. Hakim edited comment on MAHOUT-216 at 12/10/09 8:32 PM:
-------------------------------------------------------------------

the next step is to implement a tool that redistributes the instances over the partitions, in order to get a uniform distribution of classes.
The tool that I implemented has the following simple algorithm:
{code}
input: 
* data
* number of partitions p

create p output files, one for each partition
for each class c, currents[c] = the partition where the next tuple of class c will be put. currents[] is randomly initialized with integers in the range [0, p[

for each tuple do the following:
* put the tuple in the file corresponding to the partition currents[tuple.class]
* currents[tuple.class]++; if (currents[tuple.class] == p) currents[tuple.class) = 0;
end for
{code}

computing the frequency distribution of the class attribute on the "uniform" data gives the following results:
||||label1||label2||label3||...||
|partition 0|98218|	3|	1|	0|	107202|	280789|	5|	27|	98|	1041|	1248|	2|	0|	220|	1|	1590|	0|	231|	1|	2|	102|	0|	1|
|partition 1|98220|	3|	1|	0|	107202|	280789|	5|	27|	98|	1041|	1248|	2|	0|	220|	1|	1589|	0|	231|	1|	2|	102|	1|	1|
|partition 2|98214|	3|	1|	0|	107202|	280789|	5|	26|	98|	1041|	1248|	2|	1|	220|	1|	1589|	1|	231|	1|	2|	102|	1|	1|
|partition 3|98221|	3|	1|	0|	107201|	280789|	5|	26|	98|	1041|	1248|	2|	1|	220|	1|	1589|	1|	231|	1|	2|	102|	0|	1|
|partition 4|98221|	3|	1|	0|	107201|	280788|	5|	26|	98|	1042|	1249|	2|	1|	221|	1|	1589|	1|	232|	1|	2|	102|	0|	1|
|partition 5|98218|	4|	1|	1|	107201|	280788|	5|	26|	97|	1042|	1248|	3|	1|	221|	1|	1589|	1|	232|	1|	2|	102|	0|	1|
|partition 6|98220|	3|	1|	0|	107202|	280788|	6|	26|	98|	1042|	1248|	2|	1|	221|	1|	1589|	0|	232|	1|	2|	102|	0|	1|
|partition 7|98219|	2|	1|	1|	107202|	280788|	6|	26|	98|	1041|	1248|	2|	1|	220|	1|	1589|	0|	232|	0|	2|	102|	0|	1|
|partition 8|97541|	3|	0|	1|	107204|	281453|	6|	27|	98|	1041|	1248|	2|	1|	220|	2|	1589|	0|	232|	0|	2|	102|	0|	1|
|partition 9|89489|	3|	1|	0|	107200|	280125|	5|	27|	98|	1041|	1248|	2|	1|	220|	2|	1590|	0|	232|	0|	2|	102|	0|	1|
 
This time the classes are well distributed. 

Here is a quick comparison between the original dataset and the "uniform" dataset, building 10 trees over 10 partitions with the partial implementation on the original data gave an out-of-bag error of 0.2, using the "uniform" dataset" the out-of-bag error is 1.7 E-4.

Although more tests are needed, this results are very encouraging

PS:
This program generates p files that should be used as training data instead of the original data. Although my mapreduce partial implementation should handle datasets split over many files, it did not handle them !!! For now I have to join the files in one single file before running the partial implementation =P

      was (Author: adeneche):
    the next step is to implement a tool that redistributes the instances over the partitions, in order to get a uniform distribution of classes.
The tool that I implemented has the following simple algorithm:
{code}
input: 
* data
* number of partitions p

create p output files, one for each partition
for each class c, currents[c] = the partition where the next tuple of class c will be put. currents[] is randomly initialized with integers in the range [0, p[

for each tuple do the following:
* put the tuple in the file corresponding to the partition currents[tuple.class]
* currents[tuple.class]++; if (currents[tuple.class] == p) currents[tuple.class) = 0;
end for
{code}

computing the frequency distribution of the class attribute on the "uniform" data gives the following results:
||||label1||label2||label3||...||
|partition 0|98218|	3|	1|	0|	107202|	280789|	5|	27|	98|	1041|	1248|	2|	0|	220|	1|	1590|	0|	231|	1|	2|	102|	0|	1|
|partition 1|98220|	3|	1|	0|	107202|	280789|	5|	27|	98|	1041|	1248|	2|	0|	220|	1|	1589|	0|	231|	1|	2|	102|	1|	1|
|partition 2|98214|	3|	1|	0|	107202|	280789|	5|	26|	98|	1041|	1248|	2|	1|	220|	1|	1589|	1|	231|	1|	2|	102|	1|	1|
|partition 3|98221|	3|	1|	0|	107201|	280789|	5|	26|	98|	1041|	1248|	2|	1|	220|	1|	1589|	1|	231|	1|	2|	102|	0|	1|
|partition 4|98221|	3|	1|	0|	107201|	280788|	5|	26|	98|	1042|	1249|	2|	1|	221|	1|	1589|	1|	232|	1|	2|	102|	0|	1|
|partition 5|98218|	4|	1|	1|	107201|	280788|	5|	26|	97|	1042|	1248|	3|	1|	221|	1|	1589|	1|	232|	1|	2|	102|	0|	1|
|partition 6|98220|	3|	1|	0|	107202|	280788|	6|	26|	98|	1042|	1248|	2|	1|	221|	1|	1589|	0|	232|	1|	2|	102|	0|	1|
|partition 7|98219|	2|	1|	1|	107202|	280788|	6|	26|	98|	1041|	1248|	2|	1|	220|	1|	1589|	0|	232|	0|	2|	102|	0|	1|
|partition 8|97541|	3|	0|	1|	107204|	281453|	6|	27|	98|	1041|	1248|	2|	1|	220|	2|	1589|	0|	232|	0|	2|	102|	0|	1|
|partition 9|89489|	3|	1|	0|	107200|	280125|	5|	27|	98|	1041|	1248|	2|	1|	220|	2|	1590|	0|	232|	0|	2|	102|	0|	1|
 
This time the classes are well distributed. 

Here is a quick comparison between the original dataset and the "uniform" dataset, building 10 trees over 10 partitions with the partial implementation :

||||Original||Unifomed||
|Out of Bag Error|0.2|1.7 E-4|

Although more tests are needed, this results are very encouraging

PS:
This program generates p files that should be used as training data instead of the original data. Although my mapreduce partial implementation should handle datasets split over many files, it did not handle them !!! For now I have to join the files in one single file before running the partial implementation =P
  
> Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-216
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>
> the poor results of the partial decision forest implementation may be explained by the particular distribution of the partitioned data. For example, if a partition does not contain any instance of a given class, the decision trees built using this partition won't be able to classify this class. 
> According to [CHAN, 95]:
> {quote}
> Random Selection of the partitioned data sets with a uniform distribution of classes is perhaps the most sensible solution. Here we may attempt to maintain the same frequency distribution over the ''class attribute" so that each partition represents a good but a smaller model of the entire training set
> {quote}
> [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable Data Mining" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

Posted by "Deneche A. Hakim (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797544#action_12797544 ] 

Deneche A. Hakim commented on MAHOUT-216:
-----------------------------------------

Here are some results on a 5 slave ec2 cluster, using Kdd 100%

 || Num Map Tasks || Num Trees || Build Time || oob error ||
 | 10 | 10 | 0h 2m 32s 643 | 1.7E-4 |
 | 10 | 100 | 0h 10m 5s 231 | 1.7E-4 |

the results looks good, now I'll have to try the generated classifier on kdd test data and see...

Some known issues (that'll try to fix) are:
* mapreduce implementations cannot handle multiple file datasets
* because a lot of work is done when the mappers are closing I need to refresh some Hadoop counter or the job is canceled when trying to build a lot of trees (400)


> Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-216
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>             Fix For: 0.3
>
>
> the poor results of the partial decision forest implementation may be explained by the particular distribution of the partitioned data. For example, if a partition does not contain any instance of a given class, the decision trees built using this partition won't be able to classify this class. 
> According to [CHAN, 95]:
> {quote}
> Random Selection of the partitioned data sets with a uniform distribution of classes is perhaps the most sensible solution. Here we may attempt to maintain the same frequency distribution over the ''class attribute" so that each partition represents a good but a smaller model of the entire training set
> {quote}
> [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable Data Mining" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

Posted by "Deneche A. Hakim (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788902#action_12788902 ] 

Deneche A. Hakim commented on MAHOUT-216:
-----------------------------------------

the next step is to implement a tool that redistributes the instances over the partitions, in order to get a uniform distribution of classes.
The tool that I implemented has the following simple algorithm:
{code}
input: 
* data
* number of partitions p

create p output files, one for each partition
for each class c, currents[c] = the partition where the next tuple of class c will be put. currents[] is randomly initialized with integers in the range [0, p[

for each tuple do the following:
* put the tuple in the file corresponding to the partition currents[tuple.class]
* currents[tuple.class]++; if (currents[tuple.class] == p) currents[tuple.class) = 0;
end for
{code}

computing the frequency distribution of the class attribute on the "uniform" data gives the following results:
||||label1||label2||label3||...||
|partition 0|98218|	3|	1|	0|	107202|	280789|	5|	27|	98|	1041|	1248|	2|	0|	220|	1|	1590|	0|	231|	1|	2|	102|	0|	1|
|partition 1|98220|	3|	1|	0|	107202|	280789|	5|	27|	98|	1041|	1248|	2|	0|	220|	1|	1589|	0|	231|	1|	2|	102|	1|	1|
|partition 2|98214|	3|	1|	0|	107202|	280789|	5|	26|	98|	1041|	1248|	2|	1|	220|	1|	1589|	1|	231|	1|	2|	102|	1|	1|
|partition 3|98221|	3|	1|	0|	107201|	280789|	5|	26|	98|	1041|	1248|	2|	1|	220|	1|	1589|	1|	231|	1|	2|	102|	0|	1|
|partition 4|98221|	3|	1|	0|	107201|	280788|	5|	26|	98|	1042|	1249|	2|	1|	221|	1|	1589|	1|	232|	1|	2|	102|	0|	1|
|partition 5|98218|	4|	1|	1|	107201|	280788|	5|	26|	97|	1042|	1248|	3|	1|	221|	1|	1589|	1|	232|	1|	2|	102|	0|	1|
|partition 6|98220|	3|	1|	0|	107202|	280788|	6|	26|	98|	1042|	1248|	2|	1|	221|	1|	1589|	0|	232|	1|	2|	102|	0|	1|
|partition 7|98219|	2|	1|	1|	107202|	280788|	6|	26|	98|	1041|	1248|	2|	1|	220|	1|	1589|	0|	232|	0|	2|	102|	0|	1|
|partition 8|97541|	3|	0|	1|	107204|	281453|	6|	27|	98|	1041|	1248|	2|	1|	220|	2|	1589|	0|	232|	0|	2|	102|	0|	1|
|partition 9|89489|	3|	1|	0|	107200|	280125|	5|	27|	98|	1041|	1248|	2|	1|	220|	2|	1590|	0|	232|	0|	2|	102|	0|	1|
 
This time the classes are well distributed. 

Here is a quick comparison between the original dataset and the "uniform" dataset, building 10 trees over 10 partitions with the partial implementation :

||||Original||Unifomed||
|Out of Bag Error|0.2|1.7 E-4|

Although more tests are needed, this results are very encouraging

PS:
This program generates p files that should be used as training data instead of the original data. Although my mapreduce partial implementation should handle datasets split over many files, it did not handle them !!! For now I have to join the files in one single file before running the partial implementation =P

> Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-216
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>
> the poor results of the partial decision forest implementation may be explained by the particular distribution of the partitioned data. For example, if a partition does not contain any instance of a given class, the decision trees built using this partition won't be able to classify this class. 
> According to [CHAN, 95]:
> {quote}
> Random Selection of the partitioned data sets with a uniform distribution of classes is perhaps the most sensible solution. Here we may attempt to maintain the same frequency distribution over the ''class attribute" so that each partition represents a good but a smaller model of the entire training set
> {quote}
> [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable Data Mining" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788961#action_12788961 ] 

Ted Dunning commented on MAHOUT-216:
------------------------------------


Couldn't you just resort the data using random keys?  That leaves you with as many or few files as you like and allows you to do the split any way you like at learning time.

> Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-216
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>
> the poor results of the partial decision forest implementation may be explained by the particular distribution of the partitioned data. For example, if a partition does not contain any instance of a given class, the decision trees built using this partition won't be able to classify this class. 
> According to [CHAN, 95]:
> {quote}
> Random Selection of the partitioned data sets with a uniform distribution of classes is perhaps the most sensible solution. Here we may attempt to maintain the same frequency distribution over the ''class attribute" so that each partition represents a good but a smaller model of the entire training set
> {quote}
> [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable Data Mining" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

Posted by "Deneche A. Hakim (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788878#action_12788878 ] 

Deneche A. Hakim commented on MAHOUT-216:
-----------------------------------------

First of all I implemented a simple mapreduce tool that computes the frequency distribution of the "class attribute" over the partitions.

Applying this tool over KDD 100% (the whole dataset) I got the following results:
||||||label1||label2||||label3||...||
|partition 0|380275|	3|	1|	1|	15|	112574|	53|	20|	99|	405|	1948|	1|	6|	1000|	1|	24|	1|	1040|	0|	0|	0|	0|	0|
|partition 1|182112|	2|	1|	1|	204800|	95158|	0|	20|	100|	2377|	5631|	16|	2|	1002|	11|	5365|	2|	1276|	6|	20|	0|	0|	0|
|partition 2|149880|	7|	6|	0|	206400|	124516|	0|	62|	198|	3609|	1023|	0|	0|	101|	0|	10489|	0|	0|	1|	0|	1020|	2|	7|
|partition 3|0|	0|	0|	0|	0|	489478|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|
|partition 4|0|	0|	0|	0|	0|	489475|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|
|partition 5|0|	0|	0|	0|	0|	489476|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|
|partition 6|52276|	3|	0|	1|	0|	434768|	0|	41|	99|	1027|	2556|	0|	0|	0|	0|	6|	0|	0|	0|	0|	0|	0|	0|
|partition 7|19629|	2|	1|	0|	449092|	28553|	0|	99|	383|	1191|	9|	1|	0|	0|	0|	4|	0|	0|	0|	0|	0|	0|	1|
|partition 8|0|	0|	0|	0|	0|	492696|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|	0|
|partition 9|188609|	13|	0|	0|	211710|	51192|	0|	22|	100|	1804|	1314|	3|	0|	100|	0|	4|	1|	0|	0|	0|	0|	0|	2|

As you can see the classes are not distributed uniformly, some partitions contain only one class !

> Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-216
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>
> the poor results of the partial decision forest implementation may be explained by the particular distribution of the partitioned data. For example, if a partition does not contain any instance of a given class, the decision trees built using this partition won't be able to classify this class. 
> According to [CHAN, 95]:
> {quote}
> Random Selection of the partitioned data sets with a uniform distribution of classes is perhaps the most sensible solution. Here we may attempt to maintain the same frequency distribution over the ''class attribute" so that each partition represents a good but a smaller model of the entire training set
> {quote}
> [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable Data Mining" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.