You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Ted Dunning <td...@apache.org> on 2010/10/06 23:22:44 UTC

Fwd: default tree builder

Andrey has a question about the Random Forest code.  (hurray for him for
being the first outsider to beat on it)

This looks like a simple and fairly obvious fix, but I can't afford enough
attention to verify.

Deneche?

---------- Forwarded message ----------
From: Andrey Gusev <an...@gmail.com>
Date: Wed, Oct 6, 2010 at 1:41 PM
Subject: default tree builder
To: Ted Dunning <td...@apache.org>


Hi Ted,

Unrelated to KCCA discussion. I have some pretty positive results with
bagging of DefaultTreeBuilder-id3 like implementation. However, there is
seems to be an issue with building possible running into infinite recursion,
i.e. stack overflow. Basically when you have few features, it is possible
that all results left to split can not be split according to any attribute
so in my cases line 107 of DefaultTreeBuilder keeps being called with
identical set. More precisely loSubset is empty and hiSubset contain few
instances which can not be split. I modified this into LimitDepthTreeBuilder
with something like:


    public Node build(Random rng, Data data) {
        return inner_build(rng, data, 0);
    }

    public Node inner_build(Random rng, Data data, int depth) {

        if (selected == null) {
            selected = new boolean[data.getDataset().nbAttributes()];
        }

        if (data.isEmpty()) {
            return new Leaf(-1);
        }
        if (isIdentical(data)) {
            return new Leaf(data.majorityLabel(rng));
        }
        if (data.identicalLabel()) {
            return new Leaf(data.get(0).label);
        }
        // SFDC change to prevent stack overflow
        if (depth > MAX_DEPTH) {
            return new Leaf(data.majorityLabel(rng));
        }
...

}

I know mahout-0.4 is pretty close to freeze (or past it). But this is simple
enough change and with high enough MAX_DEPTH (say 100) should not have much
functional impact unless there is actually infinite recursion. Or it could
default to not use it but have a setter for that. Let me know what you
think.

Thanks,

Andrey

Re: default tree builder

Posted by deneche abdelhakim <ad...@gmail.com>.
> Does this refer to the random forest stuff?

Yes, and the DefaultTreeBuilder is used by all implementations
(sequential, InMem and Partial).

> Andrey has a question about the Random Forest code.  (hurray for him for
> being the first outsider to beat on it)

Hurray =D

I will make the fix and create a patch as soon as possible.

I wonder if Andrey good give us the data that lead to this Bug. Just
to make sure it's not another hidden bug that's causing the infinite
recursion.

On Wed, Oct 6, 2010 at 10:22 PM, Ted Dunning <td...@apache.org> wrote:
> Andrey has a question about the Random Forest code.  (hurray for him for
> being the first outsider to beat on it)
>
> This looks like a simple and fairly obvious fix, but I can't afford enough
> attention to verify.
>
> Deneche?
>
> ---------- Forwarded message ----------
> From: Andrey Gusev <an...@gmail.com>
> Date: Wed, Oct 6, 2010 at 1:41 PM
> Subject: default tree builder
> To: Ted Dunning <td...@apache.org>
>
>
> Hi Ted,
>
> Unrelated to KCCA discussion. I have some pretty positive results with
> bagging of DefaultTreeBuilder-id3 like implementation. However, there is
> seems to be an issue with building possible running into infinite recursion,
> i.e. stack overflow. Basically when you have few features, it is possible
> that all results left to split can not be split according to any attribute
> so in my cases line 107 of DefaultTreeBuilder keeps being called with
> identical set. More precisely loSubset is empty and hiSubset contain few
> instances which can not be split. I modified this into LimitDepthTreeBuilder
> with something like:
>
>
>    public Node build(Random rng, Data data) {
>        return inner_build(rng, data, 0);
>    }
>
>    public Node inner_build(Random rng, Data data, int depth) {
>
>        if (selected == null) {
>            selected = new boolean[data.getDataset().nbAttributes()];
>        }
>
>        if (data.isEmpty()) {
>            return new Leaf(-1);
>        }
>        if (isIdentical(data)) {
>            return new Leaf(data.majorityLabel(rng));
>        }
>        if (data.identicalLabel()) {
>            return new Leaf(data.get(0).label);
>        }
>        // SFDC change to prevent stack overflow
>        if (depth > MAX_DEPTH) {
>            return new Leaf(data.majorityLabel(rng));
>        }
> ...
>
> }
>
> I know mahout-0.4 is pretty close to freeze (or past it). But this is simple
> enough change and with high enough MAX_DEPTH (say 100) should not have much
> functional impact unless there is actually infinite recursion. Or it could
> default to not use it but have a setter for that. Let me know what you
> think.
>
> Thanks,
>
> Andrey
>