You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Marco Mistroni <mm...@gmail.com> on 2016/09/14 21:18:59 UTC

Please assist: migrating RandomForestExample from MLLib to ML

hi all
 i have been toying around with this well known RandomForestExample code

val forest = RandomForest.trainClassifier(
  trainData, 7, Map(10 -> 4, 11 -> 40), 20,
  "auto", "entropy", 30, 300)

This comes from this link (
https://www.safaribooksonline.com/library/view/advanced-analytics-with/9781491912751/ch04.html),
and also Sean Owen's presentation

(https://www.youtube.com/watch?v=ObiCMJ24ezs)



and now i want to migrate it to use ML Libraries.
The problem i have is that the MLLib  example has categorical features, and
i cannot find
a way to use categorical features with ML
Apparently i should use VectorIndexer, but VectorIndexer assumes only one
input
column for features.
I am at the moment using Vectorassembler instead, but i cannot find a way
to achieve the
same
I have checed spark samples, but all i can see is RandomForestClassifier
using VectorIndexer for 1 feature



Could anyone assist?
This is my current code....what do i need to add to take into account
categorical features?

val labelIndexer = new StringIndexer()
      .setInputCol("Col0")
      .setOutputCol("indexedLabel")
      .fit(data)

    val features = new VectorAssembler()
      .setInputCols(Array(
        "Col1", "Col2", "Col3", "Col4", "Col5",
        "Col6", "Col7", "Col8", "Col9", "Col10"))
      .setOutputCol("features")

    val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)

    val rf = new RandomForestClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("features")
      .setNumTrees(20)
      .setMaxDepth(30)
      .setMaxBins(300)
      .setImpurity("entropy")

    println("Kicking off pipeline..")

    val pipeline = new Pipeline()
      .setStages(Array(labelIndexer, features, rf, labelConverter))

thanks in advance and regards
 Marco

Re: Please assist: migrating RandomForestExample from MLLib to ML

Posted by Marco Mistroni <mm...@gmail.com>.

many thanks Sean!
kr
 marco

On Wed, Sep 14, 2016 at 10:33 PM, Sean Owen <so...@cloudera.com> wrote:

> If it helps, I've already updated that code for the 2nd edition, which
> will be based on ~Spark 2.1:
>
> https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/
> scala/com/cloudera/datascience/rdf/RunRDF.scala#L220
>
> This should be an equivalent working example that deals with
> categoricals via VectorIndexer.
>
> You're right that you must use it because it adds the metadata that
> says it's categorical. I'm not sure of another way to do it?
>
> Sean
>
>
> On Wed, Sep 14, 2016 at 10:18 PM, Marco Mistroni <mm...@gmail.com>
> wrote:
> > hi all
> >  i have been toying around with this well known RandomForestExample code
> >
> > val forest = RandomForest.trainClassifier(
> >   trainData, 7, Map(10 -> 4, 11 -> 40), 20,
> >   "auto", "entropy", 30, 300)
> >
> > This comes from this link
> > (https://www.safaribooksonline.com/library/view/advanced-analytics-with/
> 9781491912751/ch04.html),
> > and also Sean Owen's presentation
> >
> > (https://www.youtube.com/watch?v=ObiCMJ24ezs)
> >
> >
> >
> > and now i want to migrate it to use ML Libraries.
> > The problem i have is that the MLLib  example has categorical features,
> and
> > i cannot find
> > a way to use categorical features with ML
> > Apparently i should use VectorIndexer, but VectorIndexer assumes only one
> > input
> > column for features.
> > I am at the moment using Vectorassembler instead, but i cannot find a
> way to
> > achieve the
> > same
> > I have checed spark samples, but all i can see is RandomForestClassifier
> > using VectorIndexer for 1 feature
> >
> >
> >
> > Could anyone assist?
> > This is my current code....what do i need to add to take into account
> > categorical features?
> >
> > val labelIndexer = new StringIndexer()
> >       .setInputCol("Col0")
> >       .setOutputCol("indexedLabel")
> >       .fit(data)
> >
> >     val features = new VectorAssembler()
> >       .setInputCols(Array(
> >         "Col1", "Col2", "Col3", "Col4", "Col5",
> >         "Col6", "Col7", "Col8", "Col9", "Col10"))
> >       .setOutputCol("features")
> >
> >     val labelConverter = new IndexToString()
> >       .setInputCol("prediction")
> >       .setOutputCol("predictedLabel")
> >       .setLabels(labelIndexer.labels)
> >
> >     val rf = new RandomForestClassifier()
> >       .setLabelCol("indexedLabel")
> >       .setFeaturesCol("features")
> >       .setNumTrees(20)
> >       .setMaxDepth(30)
> >       .setMaxBins(300)
> >       .setImpurity("entropy")
> >
> >     println("Kicking off pipeline..")
> >
> >     val pipeline = new Pipeline()
> >       .setStages(Array(labelIndexer, features, rf, labelConverter))
> >
> > thanks in advance and regards
> >  Marco
> >
>

Re: Please assist: migrating RandomForestExample from MLLib to ML

Posted by Sean Owen <so...@cloudera.com>.

If it helps, I've already updated that code for the 2nd edition, which
will be based on ~Spark 2.1:

https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/scala/com/cloudera/datascience/rdf/RunRDF.scala#L220

This should be an equivalent working example that deals with
categoricals via VectorIndexer.

You're right that you must use it because it adds the metadata that
says it's categorical. I'm not sure of another way to do it?

Sean


On Wed, Sep 14, 2016 at 10:18 PM, Marco Mistroni <mm...@gmail.com> wrote:
> hi all
>  i have been toying around with this well known RandomForestExample code
>
> val forest = RandomForest.trainClassifier(
>   trainData, 7, Map(10 -> 4, 11 -> 40), 20,
>   "auto", "entropy", 30, 300)
>
> This comes from this link
> (https://www.safaribooksonline.com/library/view/advanced-analytics-with/9781491912751/ch04.html),
> and also Sean Owen's presentation
>
> (https://www.youtube.com/watch?v=ObiCMJ24ezs)
>
>
>
> and now i want to migrate it to use ML Libraries.
> The problem i have is that the MLLib  example has categorical features, and
> i cannot find
> a way to use categorical features with ML
> Apparently i should use VectorIndexer, but VectorIndexer assumes only one
> input
> column for features.
> I am at the moment using Vectorassembler instead, but i cannot find a way to
> achieve the
> same
> I have checed spark samples, but all i can see is RandomForestClassifier
> using VectorIndexer for 1 feature
>
>
>
> Could anyone assist?
> This is my current code....what do i need to add to take into account
> categorical features?
>
> val labelIndexer = new StringIndexer()
>       .setInputCol("Col0")
>       .setOutputCol("indexedLabel")
>       .fit(data)
>
>     val features = new VectorAssembler()
>       .setInputCols(Array(
>         "Col1", "Col2", "Col3", "Col4", "Col5",
>         "Col6", "Col7", "Col8", "Col9", "Col10"))
>       .setOutputCol("features")
>
>     val labelConverter = new IndexToString()
>       .setInputCol("prediction")
>       .setOutputCol("predictedLabel")
>       .setLabels(labelIndexer.labels)
>
>     val rf = new RandomForestClassifier()
>       .setLabelCol("indexedLabel")
>       .setFeaturesCol("features")
>       .setNumTrees(20)
>       .setMaxDepth(30)
>       .setMaxBins(300)
>       .setImpurity("entropy")
>
>     println("Kicking off pipeline..")
>
>     val pipeline = new Pipeline()
>       .setStages(Array(labelIndexer, features, rf, labelConverter))
>
> thanks in advance and regards
>  Marco
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org