You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Sean Owen <so...@cloudera.com> on 2014/10/12 20:50:08 UTC

Decision forests don't work with non-trivial categorical features

I'm having trouble getting decision forests to work with categorical
features. I have a dataset with a categorical feature with 40 values.
It seems to be treated as a continuous/numeric value by the
implementation.

Digging deeper, I see there is some logic in the code that indicates
that categorical features over N values do not work unless the number
of bins is at least 2*((2^N - 1) - 1) bins. I understand this as the
naive brute force condition, wherein the decision tree will test all
possible splits of the categorical value.

However, this gets unusable quickly as the number of bins should be
tens or hundreds at best, and this requirement rules out categorical
values over more than 10 or so features as a result. But, of course,
it's not unusual to have categorical features with high cardinality.
It's almost common.

There are some pretty fine heuristics for selecting 'bins' over
categorical features when the number of bins is far fewer than the
complete, exhaustive set.

Before I open a JIRA or continue, does anyone know what I am talking
about, am I mistaken? Is this a real limitation and is it worth
pursuing these heuristics? I can't figure out how to proceed with
decision forests in MLlib otherwise.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Decision forests don't work with non-trivial categorical features

Posted by Manish Amde <ma...@gmail.com>.
Sean, sorry for missing out on the discussion.

Evan, you are correct, we are using the heuristic Sean suggested during the
multiclass PR for ordering high-arity categorical variables using the
impurity values for each categorical feature.

Joseph, thanks for fixing the bug which I think was a regression since we
added support for RFs. I don't think we have see this in 1.1.

-Manish

On Mon, Oct 13, 2014 at 11:55 AM, Joseph Bradley <jo...@databricks.com>
wrote:

> I think this is the fix:
>
> In this
> file:
> mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala
>
> methods "getFeatureOffset" and "getLeftRightFeatureOffsets" have sanity
> checks ("require") which are correct for DecisionTree but not for
> RandomForest.  You can remove those.  I've sent a PR with this and a few
> other small fixes:
>
> https://github.com/apache/spark/pull/2785
>
> I hope this fixes the bug!
>
> On Mon, Oct 13, 2014 at 11:19 AM, Sean Owen <so...@cloudera.com> wrote:
>
> > Great, we'll confer then. I'm using master / 1.2.0-SNAPSHOT. I'll send
> > some details directly under separate cover.
> >
> > On Mon, Oct 13, 2014 at 7:12 PM, Joseph Bradley <jo...@databricks.com>
> > wrote:
> > > Hi Sean,
> > >
> > > Sorry I didn't see this thread earlier!  (Thanks Ameet for pinging me.)
> > >
> > > Short version: That exception should not be thrown, so there is a bug
> > > somewhere.  The intended logic for handling high-arity categorical
> > features
> > > is about the best one can do, as far as I know.
> > >
> > > Bug finding: For my checking purposes, which branch of Spark are you
> > using,
> > > and do you have the options being submitted to DecisionTree?
> > >
> > > High-arity categorical features: As you have figured out, if you use a
> > > categorical feature with just a few categories, it is treated as
> > "unordered"
> > > so that we explicitly consider all exponentially many ways to split the
> > > categories into 2 groups.  If you use one with many categories, then it
> > is
> > > necessary to impose an order.  (The communication increases linearly in
> > the
> > > number of possible splits, so it would blow up if we considered all
> > > exponentially many splits.)  This order is chosen separately for each
> > node,
> > > so it is not a uniform order imposed over the entire tree.  This
> actually
> > > means that it is not a heuristic for regression and binary
> > classification;
> > > i.e., it chooses the same split as if we had explicitly considered all
> of
> > > the possible splits.  For multiclass classification, it is a heuristic,
> > but
> > > I don't know of a better solution.
> > >
> > > I'll check the code, but if you can forward info about the bug, that
> > would
> > > be very helpful.
> > >
> > > Thanks!
> > > Joseph
> > >
> >
>

Re: Decision forests don't work with non-trivial categorical features

Posted by Joseph Bradley <jo...@databricks.com>.
I think this is the fix:

In this
file: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala

methods "getFeatureOffset" and "getLeftRightFeatureOffsets" have sanity
checks ("require") which are correct for DecisionTree but not for
RandomForest.  You can remove those.  I've sent a PR with this and a few
other small fixes:

https://github.com/apache/spark/pull/2785

I hope this fixes the bug!

On Mon, Oct 13, 2014 at 11:19 AM, Sean Owen <so...@cloudera.com> wrote:

> Great, we'll confer then. I'm using master / 1.2.0-SNAPSHOT. I'll send
> some details directly under separate cover.
>
> On Mon, Oct 13, 2014 at 7:12 PM, Joseph Bradley <jo...@databricks.com>
> wrote:
> > Hi Sean,
> >
> > Sorry I didn't see this thread earlier!  (Thanks Ameet for pinging me.)
> >
> > Short version: That exception should not be thrown, so there is a bug
> > somewhere.  The intended logic for handling high-arity categorical
> features
> > is about the best one can do, as far as I know.
> >
> > Bug finding: For my checking purposes, which branch of Spark are you
> using,
> > and do you have the options being submitted to DecisionTree?
> >
> > High-arity categorical features: As you have figured out, if you use a
> > categorical feature with just a few categories, it is treated as
> "unordered"
> > so that we explicitly consider all exponentially many ways to split the
> > categories into 2 groups.  If you use one with many categories, then it
> is
> > necessary to impose an order.  (The communication increases linearly in
> the
> > number of possible splits, so it would blow up if we considered all
> > exponentially many splits.)  This order is chosen separately for each
> node,
> > so it is not a uniform order imposed over the entire tree.  This actually
> > means that it is not a heuristic for regression and binary
> classification;
> > i.e., it chooses the same split as if we had explicitly considered all of
> > the possible splits.  For multiclass classification, it is a heuristic,
> but
> > I don't know of a better solution.
> >
> > I'll check the code, but if you can forward info about the bug, that
> would
> > be very helpful.
> >
> > Thanks!
> > Joseph
> >
>

Re: Decision forests don't work with non-trivial categorical features

Posted by Sean Owen <so...@cloudera.com>.
Great, we'll confer then. I'm using master / 1.2.0-SNAPSHOT. I'll send
some details directly under separate cover.

On Mon, Oct 13, 2014 at 7:12 PM, Joseph Bradley <jo...@databricks.com> wrote:
> Hi Sean,
>
> Sorry I didn't see this thread earlier!  (Thanks Ameet for pinging me.)
>
> Short version: That exception should not be thrown, so there is a bug
> somewhere.  The intended logic for handling high-arity categorical features
> is about the best one can do, as far as I know.
>
> Bug finding: For my checking purposes, which branch of Spark are you using,
> and do you have the options being submitted to DecisionTree?
>
> High-arity categorical features: As you have figured out, if you use a
> categorical feature with just a few categories, it is treated as "unordered"
> so that we explicitly consider all exponentially many ways to split the
> categories into 2 groups.  If you use one with many categories, then it is
> necessary to impose an order.  (The communication increases linearly in the
> number of possible splits, so it would blow up if we considered all
> exponentially many splits.)  This order is chosen separately for each node,
> so it is not a uniform order imposed over the entire tree.  This actually
> means that it is not a heuristic for regression and binary classification;
> i.e., it chooses the same split as if we had explicitly considered all of
> the possible splits.  For multiclass classification, it is a heuristic, but
> I don't know of a better solution.
>
> I'll check the code, but if you can forward info about the bug, that would
> be very helpful.
>
> Thanks!
> Joseph
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Decision forests don't work with non-trivial categorical features

Posted by Joseph Bradley <jo...@databricks.com>.
Hi Sean,

Sorry I didn't see this thread earlier!  (Thanks Ameet for pinging me.)

Short version: That exception should not be thrown, so there is a bug
somewhere.  The intended logic for handling high-arity categorical features
is about the best one can do, as far as I know.

Bug finding: For my checking purposes, which branch of Spark are you using,
and do you have the options being submitted to DecisionTree?

High-arity categorical features: As you have figured out, if you use a
categorical feature with just a few categories, it is treated as
"unordered" so that we explicitly consider all exponentially many ways to
split the categories into 2 groups.  If you use one with many categories,
then it is necessary to impose an order.  (The communication increases
linearly in the number of possible splits, so it would blow up if we
considered all exponentially many splits.)  This order is chosen separately
for each node, so it is not a uniform order imposed over the entire tree.
This actually means that it is not a heuristic for regression and binary
classification; i.e., it chooses the same split as if we had explicitly
considered all of the possible splits.  For multiclass classification, it
is a heuristic, but I don't know of a better solution.

I'll check the code, but if you can forward info about the bug, that would
be very helpful.

Thanks!
Joseph

On Mon, Oct 13, 2014 at 1:16 AM, Sean Owen <so...@cloudera.com> wrote:

> Hm, no I don't think I'm quite right there. There's an issue but
> that's not quite it.
>
> So I have a categorical feature with 40 value, and 300 bins. The error
> I see in the end is:
>
> java.lang.IllegalArgumentException: requirement failed:
> DTStatsAggregator.getLeftRightFeatureOffsets is for unordered features
> only, but was called for ordered feature 4.
> at scala.Predef$.require(Predef.scala:233)
> at
> org.apache.spark.mllib.tree.impl.DTStatsAggregator.getLeftRightFeatureOffsets(DTStatsAggregator.scala:143)
> at
> org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$mixedBinSeqOp(DecisionTree.scala:363)
> ...
>
> So this categorical is treated as an ordered feature because you would
> have to have a load of bins to match the condition in the code I
> quote. But something isn't expecting that. Is that much worth a JIRA?
>
> Hm, but what's the theory of giving meaning to the ordering of the
> arbitrary categorical values? is that what this is trying to do, or is
> it in fact ordering by some function of the target (average value for
> regression, average 1-vs-all entropy for classification)? I suppose I
> didn't expect to encounter logic like this.
>
> On Mon, Oct 13, 2014 at 9:04 AM, Sean Owen <so...@cloudera.com> wrote:
> > I'm looking at this bit of code in DecisionTreeMetadata ...
> >
> > val maxCategoriesForUnorderedFeature =
> >   ((math.log(maxPossibleBins / 2 + 1) / math.log(2.0)) + 1).floor.toInt
> > strategy.categoricalFeaturesInfo.foreach { case (featureIndex,
> numCategories) =>
> >   // Decide if some categorical features should be treated as
> > unordered features,
> >   //  which require 2 * ((1 << numCategories - 1) - 1) bins.
> >   // We do this check with log values to prevent overflows in case
> > numCategories is large.
> >   // The next check is equivalent to: 2 * ((1 << numCategories - 1) -
> > 1) <= maxBins
> >   if (numCategories <= maxCategoriesForUnorderedFeature) {
> >     unorderedFeatures.add(featureIndex)
> >     numBins(featureIndex) = numUnorderedBins(numCategories)
> >   } else {
> >     numBins(featureIndex) = numCategories
> >   }
> > }
> >
> > So if I have a feature with 40 values and less than about a trillion
> > bins, it gets treated as a continuous feature, which is meaningless.
> > It shortly throws an exception though since other parts of the code
> > expect this to be a categorical feature.
> >
> > I think there's a bug here somewhere but wasn't sure whether it was
> > just 'not implemented' yet and so needs a better error message (and
> > should be implemented), or something else preventing this from working
> > as expected.
> >
> > I'll wait a beat to get more info and then if needed open a JIRA. Thanks
> all.
> >
> > On Mon, Oct 13, 2014 at 3:34 AM, Evan Sparks <ev...@gmail.com>
> wrote:
> >> I was under the impression that we were using the usual sort by average
> response value heuristic when storing histogram bins (and searching for
> optimal splits) in the tree code.
> >>
> >> Maybe Manish or Joseph can clarify?
> >>
> >>> On Oct 12, 2014, at 2:50 PM, Sean Owen <so...@cloudera.com> wrote:
> >>>
> >>> I'm having trouble getting decision forests to work with categorical
> >>> features. I have a dataset with a categorical feature with 40 values.
> >>> It seems to be treated as a continuous/numeric value by the
> >>> implementation.
> >>>
> >>> Digging deeper, I see there is some logic in the code that indicates
> >>> that categorical features over N values do not work unless the number
> >>> of bins is at least 2*((2^N - 1) - 1) bins. I understand this as the
> >>> naive brute force condition, wherein the decision tree will test all
> >>> possible splits of the categorical value.
> >>>
> >>> However, this gets unusable quickly as the number of bins should be
> >>> tens or hundreds at best, and this requirement rules out categorical
> >>> values over more than 10 or so features as a result. But, of course,
> >>> it's not unusual to have categorical features with high cardinality.
> >>> It's almost common.
> >>>
> >>> There are some pretty fine heuristics for selecting 'bins' over
> >>> categorical features when the number of bins is far fewer than the
> >>> complete, exhaustive set.
> >>>
> >>> Before I open a JIRA or continue, does anyone know what I am talking
> >>> about, am I mistaken? Is this a real limitation and is it worth
> >>> pursuing these heuristics? I can't figure out how to proceed with
> >>> decision forests in MLlib otherwise.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Decision forests don't work with non-trivial categorical features

Posted by Sean Owen <so...@cloudera.com>.
Hm, no I don't think I'm quite right there. There's an issue but
that's not quite it.

So I have a categorical feature with 40 value, and 300 bins. The error
I see in the end is:

java.lang.IllegalArgumentException: requirement failed:
DTStatsAggregator.getLeftRightFeatureOffsets is for unordered features
only, but was called for ordered feature 4.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.mllib.tree.impl.DTStatsAggregator.getLeftRightFeatureOffsets(DTStatsAggregator.scala:143)
at org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$mixedBinSeqOp(DecisionTree.scala:363)
...

So this categorical is treated as an ordered feature because you would
have to have a load of bins to match the condition in the code I
quote. But something isn't expecting that. Is that much worth a JIRA?

Hm, but what's the theory of giving meaning to the ordering of the
arbitrary categorical values? is that what this is trying to do, or is
it in fact ordering by some function of the target (average value for
regression, average 1-vs-all entropy for classification)? I suppose I
didn't expect to encounter logic like this.

On Mon, Oct 13, 2014 at 9:04 AM, Sean Owen <so...@cloudera.com> wrote:
> I'm looking at this bit of code in DecisionTreeMetadata ...
>
> val maxCategoriesForUnorderedFeature =
>   ((math.log(maxPossibleBins / 2 + 1) / math.log(2.0)) + 1).floor.toInt
> strategy.categoricalFeaturesInfo.foreach { case (featureIndex, numCategories) =>
>   // Decide if some categorical features should be treated as
> unordered features,
>   //  which require 2 * ((1 << numCategories - 1) - 1) bins.
>   // We do this check with log values to prevent overflows in case
> numCategories is large.
>   // The next check is equivalent to: 2 * ((1 << numCategories - 1) -
> 1) <= maxBins
>   if (numCategories <= maxCategoriesForUnorderedFeature) {
>     unorderedFeatures.add(featureIndex)
>     numBins(featureIndex) = numUnorderedBins(numCategories)
>   } else {
>     numBins(featureIndex) = numCategories
>   }
> }
>
> So if I have a feature with 40 values and less than about a trillion
> bins, it gets treated as a continuous feature, which is meaningless.
> It shortly throws an exception though since other parts of the code
> expect this to be a categorical feature.
>
> I think there's a bug here somewhere but wasn't sure whether it was
> just 'not implemented' yet and so needs a better error message (and
> should be implemented), or something else preventing this from working
> as expected.
>
> I'll wait a beat to get more info and then if needed open a JIRA. Thanks all.
>
> On Mon, Oct 13, 2014 at 3:34 AM, Evan Sparks <ev...@gmail.com> wrote:
>> I was under the impression that we were using the usual sort by average response value heuristic when storing histogram bins (and searching for optimal splits) in the tree code.
>>
>> Maybe Manish or Joseph can clarify?
>>
>>> On Oct 12, 2014, at 2:50 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>> I'm having trouble getting decision forests to work with categorical
>>> features. I have a dataset with a categorical feature with 40 values.
>>> It seems to be treated as a continuous/numeric value by the
>>> implementation.
>>>
>>> Digging deeper, I see there is some logic in the code that indicates
>>> that categorical features over N values do not work unless the number
>>> of bins is at least 2*((2^N - 1) - 1) bins. I understand this as the
>>> naive brute force condition, wherein the decision tree will test all
>>> possible splits of the categorical value.
>>>
>>> However, this gets unusable quickly as the number of bins should be
>>> tens or hundreds at best, and this requirement rules out categorical
>>> values over more than 10 or so features as a result. But, of course,
>>> it's not unusual to have categorical features with high cardinality.
>>> It's almost common.
>>>
>>> There are some pretty fine heuristics for selecting 'bins' over
>>> categorical features when the number of bins is far fewer than the
>>> complete, exhaustive set.
>>>
>>> Before I open a JIRA or continue, does anyone know what I am talking
>>> about, am I mistaken? Is this a real limitation and is it worth
>>> pursuing these heuristics? I can't figure out how to proceed with
>>> decision forests in MLlib otherwise.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Decision forests don't work with non-trivial categorical features

Posted by Sean Owen <so...@cloudera.com>.
I'm looking at this bit of code in DecisionTreeMetadata ...

val maxCategoriesForUnorderedFeature =
  ((math.log(maxPossibleBins / 2 + 1) / math.log(2.0)) + 1).floor.toInt
strategy.categoricalFeaturesInfo.foreach { case (featureIndex, numCategories) =>
  // Decide if some categorical features should be treated as
unordered features,
  //  which require 2 * ((1 << numCategories - 1) - 1) bins.
  // We do this check with log values to prevent overflows in case
numCategories is large.
  // The next check is equivalent to: 2 * ((1 << numCategories - 1) -
1) <= maxBins
  if (numCategories <= maxCategoriesForUnorderedFeature) {
    unorderedFeatures.add(featureIndex)
    numBins(featureIndex) = numUnorderedBins(numCategories)
  } else {
    numBins(featureIndex) = numCategories
  }
}

So if I have a feature with 40 values and less than about a trillion
bins, it gets treated as a continuous feature, which is meaningless.
It shortly throws an exception though since other parts of the code
expect this to be a categorical feature.

I think there's a bug here somewhere but wasn't sure whether it was
just 'not implemented' yet and so needs a better error message (and
should be implemented), or something else preventing this from working
as expected.

I'll wait a beat to get more info and then if needed open a JIRA. Thanks all.

On Mon, Oct 13, 2014 at 3:34 AM, Evan Sparks <ev...@gmail.com> wrote:
> I was under the impression that we were using the usual sort by average response value heuristic when storing histogram bins (and searching for optimal splits) in the tree code.
>
> Maybe Manish or Joseph can clarify?
>
>> On Oct 12, 2014, at 2:50 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> I'm having trouble getting decision forests to work with categorical
>> features. I have a dataset with a categorical feature with 40 values.
>> It seems to be treated as a continuous/numeric value by the
>> implementation.
>>
>> Digging deeper, I see there is some logic in the code that indicates
>> that categorical features over N values do not work unless the number
>> of bins is at least 2*((2^N - 1) - 1) bins. I understand this as the
>> naive brute force condition, wherein the decision tree will test all
>> possible splits of the categorical value.
>>
>> However, this gets unusable quickly as the number of bins should be
>> tens or hundreds at best, and this requirement rules out categorical
>> values over more than 10 or so features as a result. But, of course,
>> it's not unusual to have categorical features with high cardinality.
>> It's almost common.
>>
>> There are some pretty fine heuristics for selecting 'bins' over
>> categorical features when the number of bins is far fewer than the
>> complete, exhaustive set.
>>
>> Before I open a JIRA or continue, does anyone know what I am talking
>> about, am I mistaken? Is this a real limitation and is it worth
>> pursuing these heuristics? I can't figure out how to proceed with
>> decision forests in MLlib otherwise.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Decision forests don't work with non-trivial categorical features

Posted by Evan Sparks <ev...@gmail.com>.
I was under the impression that we were using the usual sort by average response value heuristic when storing histogram bins (and searching for optimal splits) in the tree code. 

Maybe Manish or Joseph can clarify?

> On Oct 12, 2014, at 2:50 PM, Sean Owen <so...@cloudera.com> wrote:
> 
> I'm having trouble getting decision forests to work with categorical
> features. I have a dataset with a categorical feature with 40 values.
> It seems to be treated as a continuous/numeric value by the
> implementation.
> 
> Digging deeper, I see there is some logic in the code that indicates
> that categorical features over N values do not work unless the number
> of bins is at least 2*((2^N - 1) - 1) bins. I understand this as the
> naive brute force condition, wherein the decision tree will test all
> possible splits of the categorical value.
> 
> However, this gets unusable quickly as the number of bins should be
> tens or hundreds at best, and this requirement rules out categorical
> values over more than 10 or so features as a result. But, of course,
> it's not unusual to have categorical features with high cardinality.
> It's almost common.
> 
> There are some pretty fine heuristics for selecting 'bins' over
> categorical features when the number of bins is far fewer than the
> complete, exhaustive set.
> 
> Before I open a JIRA or continue, does anyone know what I am talking
> about, am I mistaken? Is this a real limitation and is it worth
> pursuing these heuristics? I can't figure out how to proceed with
> decision forests in MLlib otherwise.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org