You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ikumasa Mukai (Created) (JIRA)" <ji...@apache.org> on 2012/01/05 18:01:39 UTC
[jira] [Created] (MAHOUT-943) Improbe the way to make the split
point on DF.
Improbe the way to make the split point on DF.
----------------------------------------------
Key: MAHOUT-943
URL: https://issues.apache.org/jira/browse/MAHOUT-943
Project: Mahout
Issue Type: Improvement
Components: Classification
Reporter: Ikumasa Mukai
The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
But I think this is a little too strict and think it is better on some situation to use the average value which is calced with the best IG value and the 2nd value.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-943) Improbe the way to make the split
point on DF.
Posted by "Ikumasa Mukai (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181295#comment-13181295 ]
Ikumasa Mukai commented on MAHOUT-943:
--------------------------------------
Thank you for your advice.
Yes, I have checked the way to implement and agree with you to make a option for BuildForest for switching.
I will post a patch if it will be done here!
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
> Key: MAHOUT-943
> URL: https://issues.apache.org/jira/browse/MAHOUT-943
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Ikumasa Mukai
> Labels: DecisionForest
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to use the average value which is calced with the best IG value and the 2nd value.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-943) Improbe the way to make the split
point on DF.
Posted by "Ikumasa Mukai (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ikumasa Mukai updated MAHOUT-943:
---------------------------------
Status: Patch Available (was: Open)
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
> Key: MAHOUT-943
> URL: https://issues.apache.org/jira/browse/MAHOUT-943
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Ikumasa Mukai
> Labels: DecisionForest
> Attachments: MAHOUT-943.patch
>
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to use the average value which is calced with the best IG value and the 2nd value.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-943) Improbe the way to make the split
point on DF.
Posted by "Deneche A. Hakim (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180569#comment-13180569 ]
Deneche A. Hakim commented on MAHOUT-943:
-----------------------------------------
You can inherit from IgSplit and provide your own implementation. But we'll need to be able to tell BuildForest which implementation to use
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
> Key: MAHOUT-943
> URL: https://issues.apache.org/jira/browse/MAHOUT-943
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Ikumasa Mukai
> Labels: DecisionForest
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to use the average value which is calced with the best IG value and the 2nd value.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-943) Improbe the way to make the split
point on DF.
Posted by "Wang Yue (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184004#comment-13184004 ]
Wang Yue commented on MAHOUT-943:
---------------------------------
Hi, Mukai
Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method
"
private static double variance(double[] s, double[] ss, double[] dataSize) {
double var = 0;
for (int i = 0; i < s.length; i++) {
if (dataSize[i] > 0) {
var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
}
}
return var;
}
"
While the variance in my mind should be something like
var += ss[i]/dataSize[i] - ((s[i] * s[i]) / dataSize[i]*dataSize[i]);
Please help correct me if I am wrong. Thanks
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
> Key: MAHOUT-943
> URL: https://issues.apache.org/jira/browse/MAHOUT-943
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Ikumasa Mukai
> Labels: DecisionForest
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to use the average value which is calced with the best IG value and the 2nd value.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-943) Improbe the way to make the split
point on DF.
Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184024#comment-13184024 ]
Ted Dunning commented on MAHOUT-943:
------------------------------------
Also, that isn't a particularly good way to compute variance in the first place.
Better to use Welford's method. Better, use something like the OnlineSummarizer.
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
> Key: MAHOUT-943
> URL: https://issues.apache.org/jira/browse/MAHOUT-943
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Ikumasa Mukai
> Labels: DecisionForest
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to use the average value which is calced with the best IG value and the 2nd value.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-943) Improbe the way to make the split
point on DF.
Posted by "Ikumasa Mukai (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ikumasa Mukai updated MAHOUT-943:
---------------------------------
Attachment: MAHOUT-943.patch
I made a patch.
Following Deneche-san's advice, I added a mechanism to change the config of TreeBuilder with xml.
{noformat}
<?xml version="1.0"?>
<configuration>
<treeBuilder class="org.apache.mahout.classifier.df.builder.DecisionTreeBuilder">
<igSplit class="org.apache.mahout.classifier.df.split.ClassificationSplit"/>
<m>5</m>
</treeBuilder>
</configuration>
{noformat}
ClassificationSplit class is a sample splitter which uses the average value for the point.
{noformat}
./hadoop jar $MAHOUT_HOME/mahout-examples-0.6-SNAPSHOT-job.jar \
org.apache.mahout.classifier.df.mapreduce.BuildForest \
-Dmapred.max.split.size=1874231 \
-d $KDD_DATA/KDDTrain.data \
-ds $KDD_DATA/KDDTrain+.info \
-c $MAHOUT_HOME/conf/df-config.xml \
-p -t 100 -o $KDD_DATA/model
{noformat}
I added "-c" param on BuildForest. This param should pointto the conf(XML) file.
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
> Key: MAHOUT-943
> URL: https://issues.apache.org/jira/browse/MAHOUT-943
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Ikumasa Mukai
> Labels: DecisionForest
> Attachments: MAHOUT-943.patch
>
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to use the average value which is calced with the best IG value and the 2nd value.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-943) Improbe the way to make the split
point on DF.
Posted by "Ikumasa Mukai (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186179#comment-13186179 ]
Ikumasa Mukai commented on MAHOUT-943:
--------------------------------------
I posted a patch for Regressionsplit.java on MAHOUT-945
because this issue (943) is for classification method.
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
> Key: MAHOUT-943
> URL: https://issues.apache.org/jira/browse/MAHOUT-943
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Ikumasa Mukai
> Labels: DecisionForest
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to use the average value which is calced with the best IG value and the 2nd value.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-943) Improbe the way to make the split
point on DF.
Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184032#comment-13184032 ]
Sean Owen commented on MAHOUT-943:
----------------------------------
Or RunningAverageAndStdDev does this too
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
> Key: MAHOUT-943
> URL: https://issues.apache.org/jira/browse/MAHOUT-943
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Reporter: Ikumasa Mukai
> Labels: DecisionForest
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to use the average value which is calced with the best IG value and the 2nd value.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira