You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by OBones <ob...@free.fr> on 2017/06/15 09:59:55 UTC

[How-To] Migrating from mllib.tree.DecisionTree to ml.regression.DecisionTreeRegressor

Hello,

I have written the following scala code to train a regression tree, 
based on mllib:

     val conf = new SparkConf().setAppName("DecisionTreeRegressionExample")
     val sc = new SparkContext(conf)
     val spark = new SparkSession.Builder().getOrCreate()

     val sourceData = 
spark.read.format("com.databricks.spark.csv").option("header", 
"true").option("delimiter", ";").load("C:\\Data\\source_file.csv")

     val data = sourceData.select($"X3".cast("double"), 
$"Y".cast("double"), $"X1".cast("double"), $"X2".cast("double"))

     val featureIndices = List("X1", "X2", 
"X3").map(data.columns.indexOf(_))
     val targetIndex = data.columns.indexOf("Y")

     // WARNING: Indices in categoricalFeatures info are those inside 
the vector we build from the featureIndices list
     // Column 0 has two modalities, Column 1 has three
     val categoricalFeaturesInfo = Map[Int, Int]((0, 2), (1, 3))
     val impurity = "variance"
     val maxDepth = 30
     val maxBins = 32

     val labeled = data.map(row => 
LabeledPoint(row.getDouble(targetIndex), 
Vectors.dense(featureIndices.map(row.getDouble(_)).toArray)))

     val model = DecisionTree.trainRegressor(labeled.rdd, 
categoricalFeaturesInfo, impurity, maxDepth, maxBins)

     println(model.toDebugString)

This works quite well, but I want some information from the model, one 
of them being the features importance values. As it turns out, this is 
not available on DecisionTreeModel but is available on 
DecisionTreeRegressionModel from the ml package.
I then discovered that the ml package is more recent than the mllib 
package which explains why it gives me more control over the trees I'm 
building.
So, I tried to rewrite my sample code using the ml package and it is 
very much easier to use, no need for the LabeledPoint transformation. 
Here is the code I came up with:

     val dt = new DecisionTreeRegressor()
       .setPredictionCol("Y")
       .setImpurity("variance")
       .setMaxDepth(30)
       .setMaxBins(32)

     val model = dt.fit(data)

     println(model.toDebugString)
     println(model.featureImportances.toString)

However, I cannot find a way to specify which columns are features, 
which ones are categorical and how many categories they have, like I 
used to do with the mllib package.
I did look at the DecisionTreeRegressionExample.scala example found in 
the source package, but it uses a VectorIndexer to automatically 
discover the above information which is an unnecessary step in my case 
because I already have the information at hand.

The documentation found online 
(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.regression.DecisionTreeRegressor) 
did not help either because it does not indicate the format for the 
featuresCol string property.

Thanks in advance for your help.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: [How-To] Migrating from mllib.tree.DecisionTree to ml.regression.DecisionTreeRegressor

Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.

Hi, OBones.

1. which columns are features?
For ml,
use `setFeaturesCol` and `setLabelCol` to assign input column:
https://spark.apache.org/docs/2.1.0/api/scala/index.html#
org.apache.spark.ml.classification.DecisionTreeClassifier

2. which ones are categorical?
For ml, use Transformer to create Vector.
In your case,  use VectorIndexer:
http://spark.apache.org/docs/latest/ml-features.html#vectorindexer

Above all,
use Transformer / Estimator to create Vector, and use Estimator to train
and test.





On Thu, Jun 15, 2017 at 5:59 PM, OBones <ob...@free.fr> wrote:

> Hello,
>
> I have written the following scala code to train a regression tree, based
> on mllib:
>
>     val conf = new SparkConf().setAppName("DecisionTreeRegressionExample")
>     val sc = new SparkContext(conf)
>     val spark = new SparkSession.Builder().getOrCreate()
>
>     val sourceData = spark.read.format("com.databri
> cks.spark.csv").option("header", "true").option("delimiter",
> ";").load("C:\\Data\\source_file.csv")
>
>     val data = sourceData.select($"X3".cast("double"),
> $"Y".cast("double"), $"X1".cast("double"), $"X2".cast("double"))
>
>     val featureIndices = List("X1", "X2", "X3").map(data.columns.indexOf
> (_))
>     val targetIndex = data.columns.indexOf("Y")
>
>     // WARNING: Indices in categoricalFeatures info are those inside the
> vector we build from the featureIndices list
>     // Column 0 has two modalities, Column 1 has three
>     val categoricalFeaturesInfo = Map[Int, Int]((0, 2), (1, 3))
>     val impurity = "variance"
>     val maxDepth = 30
>     val maxBins = 32
>
>     val labeled = data.map(row => LabeledPoint(row.getDouble(targetIndex),
> Vectors.dense(featureIndices.map(row.getDouble(_)).toArray)))
>
>     val model = DecisionTree.trainRegressor(labeled.rdd,
> categoricalFeaturesInfo, impurity, maxDepth, maxBins)
>
>     println(model.toDebugString)
>
> This works quite well, but I want some information from the model, one of
> them being the features importance values. As it turns out, this is not
> available on DecisionTreeModel but is available on
> DecisionTreeRegressionModel from the ml package.
> I then discovered that the ml package is more recent than the mllib
> package which explains why it gives me more control over the trees I'm
> building.
> So, I tried to rewrite my sample code using the ml package and it is very
> much easier to use, no need for the LabeledPoint transformation. Here is
> the code I came up with:
>
>     val dt = new DecisionTreeRegressor()
>       .setPredictionCol("Y")
>       .setImpurity("variance")
>       .setMaxDepth(30)
>       .setMaxBins(32)
>
>     val model = dt.fit(data)
>
>     println(model.toDebugString)
>     println(model.featureImportances.toString)
>
> However, I cannot find a way to specify which columns are features, which
> ones are categorical and how many categories they have, like I used to do
> with the mllib package.
> I did look at the DecisionTreeRegressionExample.scala example found in
> the source package, but it uses a VectorIndexer to automatically discover
> the above information which is an unnecessary step in my case because I
> already have the information at hand.
>
> The documentation found online (http://spark.apache.org/docs/
> latest/api/scala/index.html#org.apache.spark.ml.regression.D
> ecisionTreeRegressor) did not help either because it does not indicate
> the format for the featuresCol string property.
>
> Thanks in advance for your help.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: [How-To] Migrating from mllib.tree.DecisionTree to ml.regression.DecisionTreeRegressor

Posted by OBones <ob...@free.fr>.

OBones wrote:
> So, I tried to rewrite my sample code using the ml package and it is 
> very much easier to use, no need for the LabeledPoint transformation. 
> Here is the code I came up with:
>
>     val dt = new DecisionTreeRegressor()
>       .setPredictionCol("Y")
>       .setImpurity("variance")
>       .setMaxDepth(30)
>       .setMaxBins(32)
>
>     val model = dt.fit(data)
>
>     println(model.toDebugString)
>     println(model.featureImportances.toString)
>
> However, I cannot find a way to specify which columns are features, 
> which ones are categorical and how many categories they have, like I 
> used to do with the mllib package.
Well, further research led me to adding the following code to indicate 
which columns are categorical:

     val X1Attribute = 
NominalAttribute.defaultAttr.withName("X1").withValues("0", "1").toMetadata
     val X2Attribute = 
NominalAttribute.defaultAttr.withName("X2").withValues("0", "1", 
"2").toMetadata

     val dataWithAttributes = data.withColumn("X1", $"X1".as("X1", 
X1Attribute)).withColumn("X2", $"X2".as("X2", X2Attribute))

but when I run this:

   val model = dt.fit(dataWithAttributes )

I get the following error:
java.lang.IllegalArgumentException: Field "features" does not exist.

It makes sense because I am yet to find a way to specify which columns 
are features.
I also have to figure out what the label column is and what differences 
it has from the prediction column as only the latter was used with the 
mllib package.


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org