You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2014/10/24 22:34:34 UTC

[jira] [Created] (SPARK-4081) Categorical feature indexing

Joseph K. Bradley created SPARK-4081:
----------------------------------------

             Summary: Categorical feature indexing
                 Key: SPARK-4081
                 URL: https://issues.apache.org/jira/browse/SPARK-4081
             Project: Spark
          Issue Type: New Feature
          Components: MLlib
    Affects Versions: 1.1.0
            Reporter: Joseph K. Bradley
            Priority: Minor


DecisionTree and RandomForest require that categorical features and labels be indexed 0,1,2....  There is currently no code to aid with indexing a dataset.  This is a proposal for a helper class for computing indices (and also deciding which features to treat as categorical).

Proposed functionality:
* This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter.
* This can also map categorical feature values to 0-based indices.

Usage:
{code}
val myData1: RDD[Vector] = ...
val myData2: RDD[Vector] = ...
val datasetIndexer = new DatasetIndexer(maxCategories)
datasetIndexer.fit(myData1)
val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
datasetIndexer.fit(myData2)
val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
val categoricalFeaturesInfo: Map[Int, Int] = datasetIndexer.getCategoricalFeaturesInfo()
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org