You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aseem Bansal <as...@gmail.com> on 2017/03/23 14:34:10 UTC

Does spark's random forest need categorical features to be one hot encoded?

I was reading
http://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest
and found that needs to be done in sklearn. Is that required in spark?

Re: Does spark's random forest need categorical features to be one hot encoded?

Posted by Ryan <ry...@gmail.com>.

no you don't need one hot. but since the feature column is a vector and
vector only accepts numbers, if your feature is string then a StringIndexer
is needed.

http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier
here's an example.

On Thu, Mar 23, 2017 at 10:34 PM, Aseem Bansal <as...@gmail.com> wrote:

> I was reading http://datascience.stackexchange.com/questions/
> 5226/strings-as-features-in-decision-tree-random-forest and found that
> needs to be done in sklearn. Is that required in spark?
>