You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2014/02/19 11:39:44 UTC

[jira] [Comment Edited] (MAHOUT-1419) Random decision forest is excessively slow on numeric features

    [ https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905075#comment-13905075 ] 

Sean Owen edited comment on MAHOUT-1419 at 2/19/14 8:57 AM:
------------------------------------------------------------

The significant change is computing 'split points' rather than considering all values as split points.

For numeric features, this means taking percentiles, rather than every distinct value, which was incredibly slow for any data set with a numeric feature.

Categorical split points were optimized -- was pointlessly allocating a count array for every datum, not every distinct category value.

Handling of the count arrays and counting occurrences was unified, and simplified; there was no point in making them members as they were not reused.

(There are a few micro-optimizations, such as to the entropy method.)
(Also fixed an NPE in BuildForest.)

The tests had to change as a result. The test for equivalence between OptIgSplit and DefaultIgSplit is no longer invalid, as they intentionally do not behave the same way. The VisualizerTest now results in different values, but I verified that it's actually a superficial difference. The trees chosen before and after are equivalent since the decision thresholds, while different, chop up the data identically.

I end up observing a *50x* speedup with this change. Although this is in a test that exercises only building, and on a data set that would be maximally affected by this bottleneck. 


was (Author: srowen):
The significant change is computing 'split points' rather than considering all values as split points.

For numeric features, this means taking percentiles, rather than every distinct value, which was incredibly slow for any data set with a numeric feature.

Categorical split points were optimized -- was pointlessly allocating a count array for every datum, not every distinct category value.

Handling of the count arrays and counting occurrences was unified, and simplified; there was no point in making them members as they were not reused.

(There are a few micro-optimizations, such as to the entropy method.)
(Also fixed an NPE in BuildForest.)

> Random decision forest is excessively slow on numeric features
> --------------------------------------------------------------
>
>                 Key: MAHOUT-1419
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1419
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.7, 0.8, 0.9
>            Reporter: Sean Owen
>         Attachments: MAHOUT-1419.patch
>
>
> Follow-up to MAHOUT-1417. There's a customer running this and observing it take an unreasonably long time on about 2GB of data -- like, >24 hours when other RDF M/R implementations take 9 minutes. The difference is big enough to probably be considered a defect. MAHOUT-1417 got that down to about 5 hours. I am trying to further improve it.
> One key issue seems to be how splits are evaluated over numeric features. A split is tried for every distinct numeric value of the feature in the whole data set. Since these are floating point values, they could (and in the customer's case are) all distinct. 200K rows means 200K splits to evaluate every time a node is built on the feature.
> A better approach is to sample percentiles out of the feature and evaluate only those as splits. Really doing that efficiently would require a lot of rewrite. However, there are some modest changes possible which get some of the benefit, and appear to make it run about 3x faster. That is --on a data set that exhibits this problem -- meaning one using numeric features which are generally distinct. Which is not exotic.
> There are comparable but different problems with handling of categorical features, but that's for a different patch.
> I have a patch, but it changes behavior to some extent since it is evaluating only a sample of splits instead of every single possible one. In particular it makes the output of "OptIgSplit" no longer match the "DefaultIgSplit". Although I think the point is that "optimized" may mean giving different choices of split here, which could yield differing trees. So that test probably has to go.
> (Along the way I found a number of micro-optimizations in this part of the code that added up to maybe a 3% speedup. And fixed an NPE too.)
> I will propose a patch shortly with all of this for thoughts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)