You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Yang Zhou <bi...@gmail.com> on 2012/11/01 14:29:01 UTC

The function of the parameter complemented in DecisionTreeBuilder

If I call the method setComplemented(boolean complemented)  with the
parameter True, how does this effect the tree builder? Thanks for the help!

Re: The function of the parameter complemented in DecisionTreeBuilder

Posted by Yang Zhou <bi...@gmail.com>.
I read the code already, but still can not get a clear idea about it.
Complemented is true by default and it has an impact on how the tree
builder works.

The following is the code snippet from DecisionTreeBuilder.java

Collection<Double> subsetValues = null;
      if (complemented) {
        subsetValues = Sets.newHashSet();
        for (double value : values) {
          subsetValues.add(value);
        }
        values = fullSet.values(best.getAttr());
      }

      int cnt = 0;
      Data[] subsets = new Data[values.length];
      for (int index = 0; index < values.length; index++) {
        if (complemented && !subsetValues.contains(values[index])) {
          continue;
        }
        subsets[index] = data.subset(Condition.equals(best.getAttr(),
values[index]));
        if (subsets[index].size() >= minSplitNum) {
          cnt++;
        }
      }

On Thu, Nov 1, 2012 at 10:16 PM, Anca Leuca <an...@gmail.com> wrote:

> Hi,
>
> If I call the method setComplemented(boolean complemented)  with the
> > parameter True, how does this effect the tree builder? Thanks for the
> help!
> >
>
> The source<
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/builder/DecisionTreeBuilder.java#DecisionTreeBuilder.0complemented
> >
> suggests
> that complemented is true by default so I guess it wouldn't?
>
> That being said, I don't know what complemented actually does, the code
> might be useful to look at, or maybe someone more knowledgeable than me
> could shed some light on this.
>
> Anca
>

Re: The function of the parameter complemented in DecisionTreeBuilder

Posted by Anca Leuca <an...@gmail.com>.
Ah, good that we reached an agreement. No problem, I quite enjoyed it.

On Fri, Nov 2, 2012 at 5:49 PM, Yang Zhou <bi...@gmail.com> wrote:

> Hi,
>
> Sorry about those confusing words.
>
> I do not mean a bug there.  What I mean, which is the same as what you
> said,  is whether complemented is true or not, the value of cnt in line 288
> is the same. And complemented does affect how the leaves are built.
>
> Really appreciate your time!
>
> On Sat, Nov 3, 2012 at 1:07 AM, Anca Leuca <an...@gmail.com>
> wrote:
>
> > Hi,
> >
> > However, when complemented = true, the split is still based on the same
> > > possible values of C from the data that is passed to the method.
> >
> >
> > Yes. The split is indeed based on a subset of the data.
> >
> >
> > > As said by
> > > the code  from line 278 to line 280, if a value of C is contained in
> the
> > > entire dataset, but not the data that is passed to the method, the
> > continue
> > > statement is executed. So those values of C that are not contained in
> the
> > > data passed to the method do not affect the method.
> > >
> >
> > Not sure what you mean by 'affect the method'. I think the datapoints
> that
> > refer to values of C not contained in the data passed are not meant to
> > change the calculations.
> > Also, *c**ontinue* is being called twice: in the loop 277-285 and the
> loop
> > 303-317, under the same conditions. So technically I don't think there's
> a
> > bug there, although admittedly it's not a very clean/obvious solution :).
> >
> >
> > > In a word, whether complemented is true or false, the result after
> > > executing the code from line 267 to line 285 is the same.
> > >
> >
> > Again, I am not sure what you mean by 'result'. If you mean the variable
> *
> > subsets*, yes, that one will have the same value, regardless of
> > complemented. The interesting stuff, however, happens in lines 302-332,
> > where the 'complementing' leaves are being built.
> >
> > That being said, I think the best approach would be to just give the tree
> > builder a test and see what it spits out, for a simple dataset that you
> can
> > eyeball. Or have a look at the unit tests (if any), they should also
> give a
> > clue on what was meant.
> >
> > Anca
> >
> >
> > > On Fri, Nov 2, 2012 at 10:47 PM, Anca Leuca <an...@gmail.com>
> > > wrote:
> > >
> > > > Hi Yang,
> > > >
> > > > I think I understand it better now, as well. So this is what I think
> it
> > > > does:
> > > >
> > > > First of all, I think it only affects the categorical node splits. It
> > > will
> > > > work as following in this scenario:
> > > > Let us consider a dataset D we want to build a decision tree from.
> > > > Let's say the tree has been partially built, and we've reached a
> > > > categorical attribute C that we want to split on.
> > > >
> > > > As I understand it, when parametrized = false, on that node we might
> > only
> > > > branch on a subset of possible values of C.
> > > >
> > > > When parametrized = true, however, we will 'force' branching on all
> > > > possible values of C from the entire dataset, and replace the missing
> > > data
> > > > with leaves having a label computed from the parent data (line 307):
> > > >
> > > > if (data.getDataset
> > > > <
> > > >
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> > > > >().isNumerical
> > > > <
> > > >
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.isNumerical%28int%29
> > > > >(data.getDataset
> > > > <
> > > >
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> > > > >().getLabelId
> > > > <
> > > >
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.getLabelId%28%29
> > > > >()))
> > > > {
> > > >
> > > > label = sum / data.size
> > > > <
> > > >
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.size%28%29
> > > > >();
> > > >
> > > > } else {
> > > >
> > > > label = data.majorityLabel
> > > > <
> > > >
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.majorityLabel%28java.util.Random%29
> > > > >(rng);
> > > >
> > > > }
> > > >
> > > >
> > > > I hope this is correct and helps with understanding it better.
> > > >
> > > >
> > > > Also, I found this <https://issues.apache.org/jira/browse/MAHOUT-840
> >,
> > > > it's the Jira task that introduced the DecisionTreeBuilder, take a
> > > > look at the comments, maybe it'll help you as well.
> > > >
> > > >
> > > >
> > > > Anca
> > > >
> > >
> >
>

Re: The function of the parameter complemented in DecisionTreeBuilder

Posted by Yang Zhou <bi...@gmail.com>.
Hi,

Sorry about those confusing words.

I do not mean a bug there.  What I mean, which is the same as what you
said,  is whether complemented is true or not, the value of cnt in line 288
is the same. And complemented does affect how the leaves are built.

Really appreciate your time!

On Sat, Nov 3, 2012 at 1:07 AM, Anca Leuca <an...@gmail.com> wrote:

> Hi,
>
> However, when complemented = true, the split is still based on the same
> > possible values of C from the data that is passed to the method.
>
>
> Yes. The split is indeed based on a subset of the data.
>
>
> > As said by
> > the code  from line 278 to line 280, if a value of C is contained in the
> > entire dataset, but not the data that is passed to the method, the
> continue
> > statement is executed. So those values of C that are not contained in the
> > data passed to the method do not affect the method.
> >
>
> Not sure what you mean by 'affect the method'. I think the datapoints that
> refer to values of C not contained in the data passed are not meant to
> change the calculations.
> Also, *c**ontinue* is being called twice: in the loop 277-285 and the loop
> 303-317, under the same conditions. So technically I don't think there's a
> bug there, although admittedly it's not a very clean/obvious solution :).
>
>
> > In a word, whether complemented is true or false, the result after
> > executing the code from line 267 to line 285 is the same.
> >
>
> Again, I am not sure what you mean by 'result'. If you mean the variable *
> subsets*, yes, that one will have the same value, regardless of
> complemented. The interesting stuff, however, happens in lines 302-332,
> where the 'complementing' leaves are being built.
>
> That being said, I think the best approach would be to just give the tree
> builder a test and see what it spits out, for a simple dataset that you can
> eyeball. Or have a look at the unit tests (if any), they should also give a
> clue on what was meant.
>
> Anca
>
>
> > On Fri, Nov 2, 2012 at 10:47 PM, Anca Leuca <an...@gmail.com>
> > wrote:
> >
> > > Hi Yang,
> > >
> > > I think I understand it better now, as well. So this is what I think it
> > > does:
> > >
> > > First of all, I think it only affects the categorical node splits. It
> > will
> > > work as following in this scenario:
> > > Let us consider a dataset D we want to build a decision tree from.
> > > Let's say the tree has been partially built, and we've reached a
> > > categorical attribute C that we want to split on.
> > >
> > > As I understand it, when parametrized = false, on that node we might
> only
> > > branch on a subset of possible values of C.
> > >
> > > When parametrized = true, however, we will 'force' branching on all
> > > possible values of C from the entire dataset, and replace the missing
> > data
> > > with leaves having a label computed from the parent data (line 307):
> > >
> > > if (data.getDataset
> > > <
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> > > >().isNumerical
> > > <
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.isNumerical%28int%29
> > > >(data.getDataset
> > > <
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> > > >().getLabelId
> > > <
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.getLabelId%28%29
> > > >()))
> > > {
> > >
> > > label = sum / data.size
> > > <
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.size%28%29
> > > >();
> > >
> > > } else {
> > >
> > > label = data.majorityLabel
> > > <
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.majorityLabel%28java.util.Random%29
> > > >(rng);
> > >
> > > }
> > >
> > >
> > > I hope this is correct and helps with understanding it better.
> > >
> > >
> > > Also, I found this <https://issues.apache.org/jira/browse/MAHOUT-840>,
> > > it's the Jira task that introduced the DecisionTreeBuilder, take a
> > > look at the comments, maybe it'll help you as well.
> > >
> > >
> > >
> > > Anca
> > >
> >
>

Re: The function of the parameter complemented in DecisionTreeBuilder

Posted by Anca Leuca <an...@gmail.com>.
Hi,

However, when complemented = true, the split is still based on the same
> possible values of C from the data that is passed to the method.


Yes. The split is indeed based on a subset of the data.


> As said by
> the code  from line 278 to line 280, if a value of C is contained in the
> entire dataset, but not the data that is passed to the method, the continue
> statement is executed. So those values of C that are not contained in the
> data passed to the method do not affect the method.
>

Not sure what you mean by 'affect the method'. I think the datapoints that
refer to values of C not contained in the data passed are not meant to
change the calculations.
Also, *c**ontinue* is being called twice: in the loop 277-285 and the loop
303-317, under the same conditions. So technically I don't think there's a
bug there, although admittedly it's not a very clean/obvious solution :).


> In a word, whether complemented is true or false, the result after
> executing the code from line 267 to line 285 is the same.
>

Again, I am not sure what you mean by 'result'. If you mean the variable *
subsets*, yes, that one will have the same value, regardless of
complemented. The interesting stuff, however, happens in lines 302-332,
where the 'complementing' leaves are being built.

That being said, I think the best approach would be to just give the tree
builder a test and see what it spits out, for a simple dataset that you can
eyeball. Or have a look at the unit tests (if any), they should also give a
clue on what was meant.

Anca


> On Fri, Nov 2, 2012 at 10:47 PM, Anca Leuca <an...@gmail.com>
> wrote:
>
> > Hi Yang,
> >
> > I think I understand it better now, as well. So this is what I think it
> > does:
> >
> > First of all, I think it only affects the categorical node splits. It
> will
> > work as following in this scenario:
> > Let us consider a dataset D we want to build a decision tree from.
> > Let's say the tree has been partially built, and we've reached a
> > categorical attribute C that we want to split on.
> >
> > As I understand it, when parametrized = false, on that node we might only
> > branch on a subset of possible values of C.
> >
> > When parametrized = true, however, we will 'force' branching on all
> > possible values of C from the entire dataset, and replace the missing
> data
> > with leaves having a label computed from the parent data (line 307):
> >
> > if (data.getDataset
> > <
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> > >().isNumerical
> > <
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.isNumerical%28int%29
> > >(data.getDataset
> > <
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> > >().getLabelId
> > <
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.getLabelId%28%29
> > >()))
> > {
> >
> > label = sum / data.size
> > <
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.size%28%29
> > >();
> >
> > } else {
> >
> > label = data.majorityLabel
> > <
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.majorityLabel%28java.util.Random%29
> > >(rng);
> >
> > }
> >
> >
> > I hope this is correct and helps with understanding it better.
> >
> >
> > Also, I found this <https://issues.apache.org/jira/browse/MAHOUT-840>,
> > it's the Jira task that introduced the DecisionTreeBuilder, take a
> > look at the comments, maybe it'll help you as well.
> >
> >
> >
> > Anca
> >
>

Re: The function of the parameter complemented in DecisionTreeBuilder

Posted by Yang Zhou <bi...@gmail.com>.
Hi Anca,

Thanks for replying, and it corrects my understanding. The method only use
the data passed to it to decide whether to split a node or not.  And I
might find a problem with the code. Please look at the code from line 277
to line 285 of this file,
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/builder/DecisionTreeBuilder.java?av=f

I agree with that when complemented = false, on that node we might only
branch on a subset of possible values of C, which is contained in the data
that is passed to the method.

However, when complemented = true, the split is still based on the same
possible values of C from the data that is passed to the method. As said by
the code  from line 278 to line 280, if a value of C is contained in the
entire dataset, but not the data that is passed to the method, the continue
statement is executed. So those values of C that are not contained in the
data passed to the method do not affect the method.

In a word, whether complemented is true or false, the result after
executing the code from line 267 to line 285 is the same.

On Fri, Nov 2, 2012 at 10:47 PM, Anca Leuca <an...@gmail.com> wrote:

> Hi Yang,
>
> I think I understand it better now, as well. So this is what I think it
> does:
>
> First of all, I think it only affects the categorical node splits. It will
> work as following in this scenario:
> Let us consider a dataset D we want to build a decision tree from.
> Let's say the tree has been partially built, and we've reached a
> categorical attribute C that we want to split on.
>
> As I understand it, when parametrized = false, on that node we might only
> branch on a subset of possible values of C.
>
> When parametrized = true, however, we will 'force' branching on all
> possible values of C from the entire dataset, and replace the missing data
> with leaves having a label computed from the parent data (line 307):
>
> if (data.getDataset
> <
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> >().isNumerical
> <
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.isNumerical%28int%29
> >(data.getDataset
> <
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> >().getLabelId
> <
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.getLabelId%28%29
> >()))
> {
>
> label = sum / data.size
> <
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.size%28%29
> >();
>
> } else {
>
> label = data.majorityLabel
> <
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.majorityLabel%28java.util.Random%29
> >(rng);
>
> }
>
>
> I hope this is correct and helps with understanding it better.
>
>
> Also, I found this <https://issues.apache.org/jira/browse/MAHOUT-840>,
> it's the Jira task that introduced the DecisionTreeBuilder, take a
> look at the comments, maybe it'll help you as well.
>
>
>
> Anca
>

Re: The function of the parameter complemented in DecisionTreeBuilder

Posted by Anca Leuca <an...@gmail.com>.
Hi Yang,

I think I understand it better now, as well. So this is what I think it
does:

First of all, I think it only affects the categorical node splits. It will
work as following in this scenario:
Let us consider a dataset D we want to build a decision tree from.
Let's say the tree has been partially built, and we've reached a
categorical attribute C that we want to split on.

As I understand it, when parametrized = false, on that node we might only
branch on a subset of possible values of C.

When parametrized = true, however, we will 'force' branching on all
possible values of C from the entire dataset, and replace the missing data
with leaves having a label computed from the parent data (line 307):

if (data.getDataset
<http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29>().isNumerical
<http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.isNumerical%28int%29>(data.getDataset
<http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29>().getLabelId
<http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.getLabelId%28%29>()))
{

label = sum / data.size
<http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.size%28%29>();

} else {

label = data.majorityLabel
<http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.majorityLabel%28java.util.Random%29>(rng);

}


I hope this is correct and helps with understanding it better.


Also, I found this <https://issues.apache.org/jira/browse/MAHOUT-840>,
it's the Jira task that introduced the DecisionTreeBuilder, take a
look at the comments, maybe it'll help you as well.



Anca

Re: The function of the parameter complemented in DecisionTreeBuilder

Posted by Yang Zhou <bi...@gmail.com>.
Hi,

I read the code again and now understand how the parameter complemented
works now.

Basically, it only effects how the method public Node build(Random rng,
Data data) works. When complemented is true, the method decides how to
split the node based on the data contained in the full data set, but not in
the parameter data passed to the method. And it also effects the assignment
of labels for leaves.

On Thu, Nov 1, 2012 at 10:16 PM, Anca Leuca <an...@gmail.com> wrote:

> Hi,
>
> If I call the method setComplemented(boolean complemented)  with the
> > parameter True, how does this effect the tree builder? Thanks for the
> help!
> >
>
> The source<
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/builder/DecisionTreeBuilder.java#DecisionTreeBuilder.0complemented
> >
> suggests
> that complemented is true by default so I guess it wouldn't?
>
> That being said, I don't know what complemented actually does, the code
> might be useful to look at, or maybe someone more knowledgeable than me
> could shed some light on this.
>
> Anca
>

Re: The function of the parameter complemented in DecisionTreeBuilder

Posted by Anca Leuca <an...@gmail.com>.
Hi,

If I call the method setComplemented(boolean complemented)  with the
> parameter True, how does this effect the tree builder? Thanks for the help!
>

The source<http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/builder/DecisionTreeBuilder.java#DecisionTreeBuilder.0complemented>
suggests
that complemented is true by default so I guess it wouldn't?

That being said, I don't know what complemented actually does, the code
might be useful to look at, or maybe someone more knowledgeable than me
could shed some light on this.

Anca