You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Marco Mistroni <mm...@gmail.com> on 2017/12/15 22:26:18 UTC

Please Help with DecisionTree/FeatureIndexer

HI all
 i am trying to run a sample decision tree, following examples here (for
Mllib)

https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier

the example seems to use  a Vectorindexer, however i am missing something.
How does the featureIndexer knows which columns are features?
Isnt' there something missing?  or the featuresIndexer is able to figure
out by itself
which columns of teh DAtaFrame are features?

val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)// Automatically identify categorical features, and index
them.val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4) // features with > 4 distinct values are
treated as continuous.
  .fit(data)

Using this code i am getting back this exception

Exception in thread "main" java.lang.IllegalArgumentException: Field
"features" does not exist.
        at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
        at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
        at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
        at scala.collection.AbstractMap.getOrElse(Map.scala:59)
        at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
        at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
        at org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
        at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
        at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)

what am i missing?

w/kindest regarsd

 marco

Re: Please Help with DecisionTree/FeatureIndexer

Posted by Weichen Xu <we...@databricks.com>.
Hi, Marco

Do not call any single fit/transform by your self. You only need to call
`pipeline.fit`/`pipelineModel.transform`. Like following:

    val assembler = new VectorAssembler().
      setInputCols(inputData.columns.filter(_ != "Severity")).
      setOutputCol("features")

    val data = assembler.transform(inputData)

    val labelIndexer = new StringIndexer()
      .setInputCol("Severity")
      .setOutputCol("indexedLabel")

    val featureIndexer =
      new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(5) // features with > 4 distinct values are treated
as continuous.

    val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))
    // Train a DecisionTree model.
    val dt = new DecisionTreeClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("indexedFeatures")

    // Convert indexed labels back to original labels.
      val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)

    // Chain indexers and tree in a Pipeline.
    val pipeline = new Pipeline()
      .setStages(Array(assembler, labelIndexer, featureIndexer, dt,
labelConverter))

    trainingData.cache()
    testData.cache()


    // Train model. This also runs the indexers.
    val model = pipeline.fit(trainingData)

    // Make predictions.
    val predictions = model.transform(testData)


Thanks.

On Wed, Dec 20, 2017 at 5:26 AM, Marco Mistroni <mm...@gmail.com> wrote:

> Hello Weichen
>  i will try it out and le tyou know
> But, if i add assembler to the pipeline, do i still have to call
> Assembler.transform  and XXXIndexer.fit() ?
> kind regards
>  Marco
>
> On Tue, Dec 19, 2017 at 2:45 AM, Weichen Xu <we...@databricks.com>
> wrote:
>
>> Hi Marco,
>>
>> If you add assembler at the first of the pipeline, like:
>> ```
>>  val pipeline = new Pipeline()
>>       .setStages(Array(assembler, labelIndexer, featureIndexer, dt,
>> labelConverter))
>> ```
>>
>> Which error do you got ?
>>
>> I think it can work fine if the `assembler` added into pipeline.
>>
>> Thanks.
>>
>> On Tue, Dec 19, 2017 at 6:08 AM, Marco Mistroni <mm...@gmail.com>
>> wrote:
>>
>>> Hello Weichen
>>>  sorry to bother you again with my ML issue... but i feel you have more
>>> experience than i do in this and perhaps you can suggest me  if i am
>>> following the correct steps, as i seem to get confused by different
>>> examples on Decision Treees
>>>
>>> So, as a starting point i have this dataframe
>>>
>>> [BI-RADS, Age, Shape, Margin,Density,Severity]
>>>
>>> The label is 'Severity' and all others are features
>>> I am following these steps and i was wondering if you can advise if i am
>>> doing the correct thing , as i am unable to add the assembler at the
>>> beginning of the pipeilne, resorting instead to the following code
>>> <inputData is the original DataFrame>
>>>
>>>     val assembler = new VectorAssembler().
>>>       setInputCols(inputData.columns.filter(_ != "Severity")).
>>>       setOutputCol("features")
>>>
>>>     val data = assembler.transform(inputData)
>>>
>>>     val labelIndexer = new StringIndexer()
>>>       .setInputCol("Severity")
>>>       .setOutputCol("indexedLabel")
>>>       .fit(data)
>>>
>>>     val featureIndexer =
>>>       new VectorIndexer()
>>>       .setInputCol("features")
>>>       .setOutputCol("indexedFeatures")
>>>       .setMaxCategories(5) // features with > 4 distinct values are
>>> treated as continuous.
>>>       .fit(data)
>>>
>>>     val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))
>>>     // Train a DecisionTree model.
>>>     val dt = new DecisionTreeClassifier()
>>>       .setLabelCol("indexedLabel")
>>>       .setFeaturesCol("indexedFeatures")
>>>
>>>     // Convert indexed labels back to original labels.
>>>       val labelConverter = new IndexToString()
>>>       .setInputCol("prediction")
>>>       .setOutputCol("predictedLabel")
>>>       .setLabels(labelIndexer.labels)
>>>
>>>     // Chain indexers and tree in a Pipeline.
>>>     val pipeline = new Pipeline()
>>>       .setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))
>>>
>>>     trainingData.cache()
>>>     testData.cache()
>>>
>>>
>>>     // Train model. This also runs the indexers.
>>>     val model = pipeline.fit(trainingData)
>>>
>>>     // Make predictions.
>>>     val predictions = model.transform(testData)
>>>
>>>     // Select example rows to display.
>>>     predictions.select("predictedLabel", "indexedLabel",
>>> "indexedFeatures").show(5)
>>>
>>>     // Select (prediction, true label) and compute test error.
>>>     val evaluator = new MulticlassClassificationEvaluator()
>>>       .setLabelCol("indexedLabel")
>>>       .setPredictionCol("prediction")
>>>       .setMetricName("accuracy")
>>>     val accuracy = evaluator.evaluate(predictions)
>>>     println("Test Error = " + (1.0 - accuracy))
>>>
>>> Could you advise if this is the proper way to follow when using an
>>> Assembler?
>>> I was unable to add the Assembler at the beginning of the pipeline... it
>>> seems it dint get invoked as , at the moment of calling the FeatureIndexer,
>>> the column 'features' was not found
>>>
>>> this is not urgent, i'll appreciate ifyou can give me your comments
>>> kind regards
>>>  marco
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Dec 17, 2017 at 2:48 AM, Weichen Xu <we...@databricks.com>
>>> wrote:
>>>
>>>> Hi Marco,
>>>>
>>>> Yes you can apply `VectorAssembler` first in the pipeline to assemble
>>>> multiple features column.
>>>>
>>>> Thanks.
>>>>
>>>> On Sun, Dec 17, 2017 at 6:33 AM, Marco Mistroni <mm...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello Wei
>>>>>  Thanks, i should have c hecked the data
>>>>> My data has this format
>>>>> |col1|col2|col3|label|
>>>>>
>>>>> so it looks like i cannot use VectorIndexer directly (it accepts a
>>>>> Vector column).
>>>>> I am guessing what i should do is something like this (given i have
>>>>> few categorical features)
>>>>>
>>>>> val assembler = new VectorAssembler().
>>>>>       setInputCols(inputData.columns.filter(_ != "Label")).
>>>>>       setOutputCol("features")
>>>>>
>>>>>     val transformedData = assembler.transform(inputData)
>>>>>
>>>>>
>>>>>     val featureIndexer =
>>>>>       new VectorIndexer()
>>>>>       .setInputCol("features")
>>>>>       .setOutputCol("indexedFeatures")
>>>>>       .setMaxCategories(5) // features with > 4 distinct values are
>>>>> treated as continuous.
>>>>>       .fit(transformedData)
>>>>>
>>>>> ?
>>>>> Apologies for the basic question btu last time i worked on an ML
>>>>> project i was using Spark 1.x
>>>>>
>>>>> kr
>>>>>  marco
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Dec 16, 2017 1:24 PM, "Weichen Xu" <we...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, Marco,
>>>>>>
>>>>>> val data = spark.read.format("libsvm").lo
>>>>>> ad("data/mllib/sample_libsvm_data.txt")
>>>>>>
>>>>>> The data now include a feature column with name "features",
>>>>>>
>>>>>> val featureIndexer = new VectorIndexer()
>>>>>>   .setInputCol("features")   <------ Here specify the "features" column to index.
>>>>>>   .setOutputCol("indexedFeatures")
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>> On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <mm...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> HI all
>>>>>>>  i am trying to run a sample decision tree, following examples here
>>>>>>> (for Mllib)
>>>>>>>
>>>>>>> https://spark.apache.org/docs/latest/ml-classification-regre
>>>>>>> ssion.html#decision-tree-classifier
>>>>>>>
>>>>>>> the example seems to use  a Vectorindexer, however i am missing
>>>>>>> something.
>>>>>>> How does the featureIndexer knows which columns are features?
>>>>>>> Isnt' there something missing?  or the featuresIndexer is able to
>>>>>>> figure out by itself
>>>>>>> which columns of teh DAtaFrame are features?
>>>>>>>
>>>>>>> val labelIndexer = new StringIndexer()
>>>>>>>   .setInputCol("label")
>>>>>>>   .setOutputCol("indexedLabel")
>>>>>>>   .fit(data)// Automatically identify categorical features, and index them.val featureIndexer = new VectorIndexer()
>>>>>>>   .setInputCol("features")
>>>>>>>   .setOutputCol("indexedFeatures")
>>>>>>>   .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
>>>>>>>   .fit(data)
>>>>>>>
>>>>>>> Using this code i am getting back this exception
>>>>>>>
>>>>>>> Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
>>>>>>>         at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>>>>>>         at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>>>>>>         at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>>>>>>>         at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>>>>>>>         at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
>>>>>>>         at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
>>>>>>>         at org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
>>>>>>>         at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
>>>>>>>         at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)
>>>>>>>
>>>>>>> what am i missing?
>>>>>>>
>>>>>>> w/kindest regarsd
>>>>>>>
>>>>>>>  marco
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Please Help with DecisionTree/FeatureIndexer

Posted by Weichen Xu <we...@databricks.com>.
Hi Marco,

If you add assembler at the first of the pipeline, like:
```
 val pipeline = new Pipeline()
      .setStages(Array(assembler, labelIndexer, featureIndexer, dt,
labelConverter))
```

Which error do you got ?

I think it can work fine if the `assembler` added into pipeline.

Thanks.

On Tue, Dec 19, 2017 at 6:08 AM, Marco Mistroni <mm...@gmail.com> wrote:

> Hello Weichen
>  sorry to bother you again with my ML issue... but i feel you have more
> experience than i do in this and perhaps you can suggest me  if i am
> following the correct steps, as i seem to get confused by different
> examples on Decision Treees
>
> So, as a starting point i have this dataframe
>
> [BI-RADS, Age, Shape, Margin,Density,Severity]
>
> The label is 'Severity' and all others are features
> I am following these steps and i was wondering if you can advise if i am
> doing the correct thing , as i am unable to add the assembler at the
> beginning of the pipeilne, resorting instead to the following code
> <inputData is the original DataFrame>
>
>     val assembler = new VectorAssembler().
>       setInputCols(inputData.columns.filter(_ != "Severity")).
>       setOutputCol("features")
>
>     val data = assembler.transform(inputData)
>
>     val labelIndexer = new StringIndexer()
>       .setInputCol("Severity")
>       .setOutputCol("indexedLabel")
>       .fit(data)
>
>     val featureIndexer =
>       new VectorIndexer()
>       .setInputCol("features")
>       .setOutputCol("indexedFeatures")
>       .setMaxCategories(5) // features with > 4 distinct values are
> treated as continuous.
>       .fit(data)
>
>     val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))
>     // Train a DecisionTree model.
>     val dt = new DecisionTreeClassifier()
>       .setLabelCol("indexedLabel")
>       .setFeaturesCol("indexedFeatures")
>
>     // Convert indexed labels back to original labels.
>       val labelConverter = new IndexToString()
>       .setInputCol("prediction")
>       .setOutputCol("predictedLabel")
>       .setLabels(labelIndexer.labels)
>
>     // Chain indexers and tree in a Pipeline.
>     val pipeline = new Pipeline()
>       .setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))
>
>     trainingData.cache()
>     testData.cache()
>
>
>     // Train model. This also runs the indexers.
>     val model = pipeline.fit(trainingData)
>
>     // Make predictions.
>     val predictions = model.transform(testData)
>
>     // Select example rows to display.
>     predictions.select("predictedLabel", "indexedLabel",
> "indexedFeatures").show(5)
>
>     // Select (prediction, true label) and compute test error.
>     val evaluator = new MulticlassClassificationEvaluator()
>       .setLabelCol("indexedLabel")
>       .setPredictionCol("prediction")
>       .setMetricName("accuracy")
>     val accuracy = evaluator.evaluate(predictions)
>     println("Test Error = " + (1.0 - accuracy))
>
> Could you advise if this is the proper way to follow when using an
> Assembler?
> I was unable to add the Assembler at the beginning of the pipeline... it
> seems it dint get invoked as , at the moment of calling the FeatureIndexer,
> the column 'features' was not found
>
> this is not urgent, i'll appreciate ifyou can give me your comments
> kind regards
>  marco
>
>
>
>
>
>
> On Sun, Dec 17, 2017 at 2:48 AM, Weichen Xu <we...@databricks.com>
> wrote:
>
>> Hi Marco,
>>
>> Yes you can apply `VectorAssembler` first in the pipeline to assemble
>> multiple features column.
>>
>> Thanks.
>>
>> On Sun, Dec 17, 2017 at 6:33 AM, Marco Mistroni <mm...@gmail.com>
>> wrote:
>>
>>> Hello Wei
>>>  Thanks, i should have c hecked the data
>>> My data has this format
>>> |col1|col2|col3|label|
>>>
>>> so it looks like i cannot use VectorIndexer directly (it accepts a
>>> Vector column).
>>> I am guessing what i should do is something like this (given i have few
>>> categorical features)
>>>
>>> val assembler = new VectorAssembler().
>>>       setInputCols(inputData.columns.filter(_ != "Label")).
>>>       setOutputCol("features")
>>>
>>>     val transformedData = assembler.transform(inputData)
>>>
>>>
>>>     val featureIndexer =
>>>       new VectorIndexer()
>>>       .setInputCol("features")
>>>       .setOutputCol("indexedFeatures")
>>>       .setMaxCategories(5) // features with > 4 distinct values are
>>> treated as continuous.
>>>       .fit(transformedData)
>>>
>>> ?
>>> Apologies for the basic question btu last time i worked on an ML project
>>> i was using Spark 1.x
>>>
>>> kr
>>>  marco
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Dec 16, 2017 1:24 PM, "Weichen Xu" <we...@databricks.com> wrote:
>>>
>>>> Hi, Marco,
>>>>
>>>> val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_d
>>>> ata.txt")
>>>>
>>>> The data now include a feature column with name "features",
>>>>
>>>> val featureIndexer = new VectorIndexer()
>>>>   .setInputCol("features")   <------ Here specify the "features" column to index.
>>>>   .setOutputCol("indexedFeatures")
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <mm...@gmail.com>
>>>> wrote:
>>>>
>>>>> HI all
>>>>>  i am trying to run a sample decision tree, following examples here
>>>>> (for Mllib)
>>>>>
>>>>> https://spark.apache.org/docs/latest/ml-classification-regre
>>>>> ssion.html#decision-tree-classifier
>>>>>
>>>>> the example seems to use  a Vectorindexer, however i am missing
>>>>> something.
>>>>> How does the featureIndexer knows which columns are features?
>>>>> Isnt' there something missing?  or the featuresIndexer is able to
>>>>> figure out by itself
>>>>> which columns of teh DAtaFrame are features?
>>>>>
>>>>> val labelIndexer = new StringIndexer()
>>>>>   .setInputCol("label")
>>>>>   .setOutputCol("indexedLabel")
>>>>>   .fit(data)// Automatically identify categorical features, and index them.val featureIndexer = new VectorIndexer()
>>>>>   .setInputCol("features")
>>>>>   .setOutputCol("indexedFeatures")
>>>>>   .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
>>>>>   .fit(data)
>>>>>
>>>>> Using this code i am getting back this exception
>>>>>
>>>>> Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
>>>>>         at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>>>>         at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>>>>         at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>>>>>         at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>>>>>         at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
>>>>>         at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
>>>>>         at org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
>>>>>         at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
>>>>>         at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)
>>>>>
>>>>> what am i missing?
>>>>>
>>>>> w/kindest regarsd
>>>>>
>>>>>  marco
>>>>>
>>>>>
>>>>
>>
>

Re: Please Help with DecisionTree/FeatureIndexer

Posted by Weichen Xu <we...@databricks.com>.
Hi Marco,

Yes you can apply `VectorAssembler` first in the pipeline to assemble
multiple features column.

Thanks.

On Sun, Dec 17, 2017 at 6:33 AM, Marco Mistroni <mm...@gmail.com> wrote:

> Hello Wei
>  Thanks, i should have c hecked the data
> My data has this format
> |col1|col2|col3|label|
>
> so it looks like i cannot use VectorIndexer directly (it accepts a Vector
> column).
> I am guessing what i should do is something like this (given i have few
> categorical features)
>
> val assembler = new VectorAssembler().
>       setInputCols(inputData.columns.filter(_ != "Label")).
>       setOutputCol("features")
>
>     val transformedData = assembler.transform(inputData)
>
>
>     val featureIndexer =
>       new VectorIndexer()
>       .setInputCol("features")
>       .setOutputCol("indexedFeatures")
>       .setMaxCategories(5) // features with > 4 distinct values are
> treated as continuous.
>       .fit(transformedData)
>
> ?
> Apologies for the basic question btu last time i worked on an ML project i
> was using Spark 1.x
>
> kr
>  marco
>
>
>
>
>
>
>
>
>
> On Dec 16, 2017 1:24 PM, "Weichen Xu" <we...@databricks.com> wrote:
>
>> Hi, Marco,
>>
>> val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_d
>> ata.txt")
>>
>> The data now include a feature column with name "features",
>>
>> val featureIndexer = new VectorIndexer()
>>   .setInputCol("features")   <------ Here specify the "features" column to index.
>>   .setOutputCol("indexedFeatures")
>>
>>
>> Thanks.
>>
>>
>> On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <mm...@gmail.com>
>> wrote:
>>
>>> HI all
>>>  i am trying to run a sample decision tree, following examples here (for
>>> Mllib)
>>>
>>> https://spark.apache.org/docs/latest/ml-classification-regre
>>> ssion.html#decision-tree-classifier
>>>
>>> the example seems to use  a Vectorindexer, however i am missing
>>> something.
>>> How does the featureIndexer knows which columns are features?
>>> Isnt' there something missing?  or the featuresIndexer is able to figure
>>> out by itself
>>> which columns of teh DAtaFrame are features?
>>>
>>> val labelIndexer = new StringIndexer()
>>>   .setInputCol("label")
>>>   .setOutputCol("indexedLabel")
>>>   .fit(data)// Automatically identify categorical features, and index them.val featureIndexer = new VectorIndexer()
>>>   .setInputCol("features")
>>>   .setOutputCol("indexedFeatures")
>>>   .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
>>>   .fit(data)
>>>
>>> Using this code i am getting back this exception
>>>
>>> Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
>>>         at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>>         at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>>         at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>>>         at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>>>         at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
>>>         at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
>>>         at org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
>>>         at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
>>>         at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)
>>>
>>> what am i missing?
>>>
>>> w/kindest regarsd
>>>
>>>  marco
>>>
>>>
>>

Re: Please Help with DecisionTree/FeatureIndexer

Posted by Marco Mistroni <mm...@gmail.com>.
Hello Wei
 Thanks, i should have c hecked the data
My data has this format
|col1|col2|col3|label|

so it looks like i cannot use VectorIndexer directly (it accepts a Vector
column).
I am guessing what i should do is something like this (given i have few
categorical features)

val assembler = new VectorAssembler().
      setInputCols(inputData.columns.filter(_ != "Label")).
      setOutputCol("features")

    val transformedData = assembler.transform(inputData)


    val featureIndexer =
      new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(5) // features with > 4 distinct values are treated
as continuous.
      .fit(transformedData)

?
Apologies for the basic question btu last time i worked on an ML project i
was using Spark 1.x

kr
 marco









On Dec 16, 2017 1:24 PM, "Weichen Xu" <we...@databricks.com> wrote:

> Hi, Marco,
>
> val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_
> data.txt")
>
> The data now include a feature column with name "features",
>
> val featureIndexer = new VectorIndexer()
>   .setInputCol("features")   <------ Here specify the "features" column to index.
>   .setOutputCol("indexedFeatures")
>
>
> Thanks.
>
>
> On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <mm...@gmail.com>
> wrote:
>
>> HI all
>>  i am trying to run a sample decision tree, following examples here (for
>> Mllib)
>>
>> https://spark.apache.org/docs/latest/ml-classification-regre
>> ssion.html#decision-tree-classifier
>>
>> the example seems to use  a Vectorindexer, however i am missing something.
>> How does the featureIndexer knows which columns are features?
>> Isnt' there something missing?  or the featuresIndexer is able to figure
>> out by itself
>> which columns of teh DAtaFrame are features?
>>
>> val labelIndexer = new StringIndexer()
>>   .setInputCol("label")
>>   .setOutputCol("indexedLabel")
>>   .fit(data)// Automatically identify categorical features, and index them.val featureIndexer = new VectorIndexer()
>>   .setInputCol("features")
>>   .setOutputCol("indexedFeatures")
>>   .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
>>   .fit(data)
>>
>> Using this code i am getting back this exception
>>
>> Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
>>         at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>         at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>         at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>>         at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>>         at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
>>         at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
>>         at org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
>>         at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
>>         at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)
>>
>> what am i missing?
>>
>> w/kindest regarsd
>>
>>  marco
>>
>>
>

Re: Please Help with DecisionTree/FeatureIndexer

Posted by Weichen Xu <we...@databricks.com>.
Hi, Marco,

val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

The data now include a feature column with name "features",

val featureIndexer = new VectorIndexer()
  .setInputCol("features")   <------ Here specify the "features"
column to index.
  .setOutputCol("indexedFeatures")


Thanks.


On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <mm...@gmail.com> wrote:

> HI all
>  i am trying to run a sample decision tree, following examples here (for
> Mllib)
>
> https://spark.apache.org/docs/latest/ml-classification-
> regression.html#decision-tree-classifier
>
> the example seems to use  a Vectorindexer, however i am missing something.
> How does the featureIndexer knows which columns are features?
> Isnt' there something missing?  or the featuresIndexer is able to figure
> out by itself
> which columns of teh DAtaFrame are features?
>
> val labelIndexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("indexedLabel")
>   .fit(data)// Automatically identify categorical features, and index them.val featureIndexer = new VectorIndexer()
>   .setInputCol("features")
>   .setOutputCol("indexedFeatures")
>   .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
>   .fit(data)
>
> Using this code i am getting back this exception
>
> Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
>         at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>         at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>         at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>         at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>         at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
>         at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
>         at org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
>         at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
>         at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)
>
> what am i missing?
>
> w/kindest regarsd
>
>  marco
>
>